Income Prediction Project: US Census Data Analysis¶

1. Understanding the Problem and Data¶

This project focuses on building a machine learning model to predict whether an individual earns more or less than $50,000 per year based on census data. We're working with a binary classification problem using data from the US Census that contains information for about 300,000 individuals.

Let's start by importing the necessary libraries and loading our data.

InĀ [Ā ]:
# !pip list
Package                 Version
----------------------- -----------
asttokens               3.0.0
cloudpickle             3.1.1
colorama                0.4.6
comm                    0.2.2
contourpy               1.3.1
cycler                  0.12.1
debugpy                 1.8.13
decorator               5.2.1
executing               2.2.0
fonttools               4.56.0
ipykernel               6.29.5
ipython                 9.0.2
ipython_pygments_lexers 1.1.1
jedi                    0.19.2
joblib                  1.4.2
jupyter_client          8.6.3
jupyter_core            5.7.2
kiwisolver              1.4.8
llvmlite                0.44.0
matplotlib              3.10.1
matplotlib-inline       0.1.7
nest-asyncio            1.6.0
numba                   0.61.0
numpy                   2.1.3
packaging               24.2
pandas                  2.2.3
parso                   0.8.4
pillow                  11.1.0
pip                     24.2
platformdirs            4.3.6
prompt_toolkit          3.0.50
psutil                  7.0.0
pure_eval               0.2.3
Pygments                2.19.1
pyparsing               3.2.1
python-dateutil         2.9.0.post0
pytz                    2025.1
pywin32                 309
pyzmq                   26.3.0
scikit-learn            1.6.1
scipy                   1.15.2
seaborn                 0.13.2
shap                    0.47.0
six                     1.17.0
slicer                  0.0.8
stack-data              0.6.3
threadpoolctl           3.5.0
tornado                 6.4.2
tqdm                    4.67.1
traitlets               5.14.3
typing_extensions       4.12.2
tzdata                  2025.1
wcwidth                 0.2.13
xgboost                 2.1.4

Packages used for Virutal Environment¶

InĀ [Ā ]:
# Package                 Version
# ----------------------- -----------
# asttokens               3.0.0
# cloudpickle             3.1.1
# colorama                0.4.6
# comm                    0.2.2
# contourpy               1.3.1
# cycler                  0.12.1
# debugpy                 1.8.13
# decorator               5.2.1
# executing               2.2.0
# fonttools               4.56.0
# ipykernel               6.29.5
# ipython                 9.0.2
# ipython_pygments_lexers 1.1.1
# jedi                    0.19.2
# joblib                  1.4.2
# jupyter_client          8.6.3
# jupyter_core            5.7.2
# kiwisolver              1.4.8
# llvmlite                0.44.0
# matplotlib              3.10.1
# matplotlib-inline       0.1.7
# nest-asyncio            1.6.0
# numba                   0.61.0
# numpy                   2.1.3
# packaging               24.2
# pandas                  2.2.3
# parso                   0.8.4
# pillow                  11.1.0
# pip                     24.2
# platformdirs            4.3.6
# prompt_toolkit          3.0.50
# psutil                  7.0.0
# pure_eval               0.2.3
# Pygments                2.19.1
# pyparsing               3.2.1
# python-dateutil         2.9.0.post0
# pytz                    2025.1
# pywin32                 309
# pyzmq                   26.3.0
# scikit-learn            1.6.1
# scipy                   1.15.2
# seaborn                 0.13.2
# shap                    0.47.0
# six                     1.17.0
# slicer                  0.0.8
# stack-data              0.6.3
# threadpoolctl           3.5.0
# tornado                 6.4.2
# tqdm                    4.67.1
# traitlets               5.14.3
# typing_extensions       4.12.2
# tzdata                  2025.1
# wcwidth                 0.2.13
# xgboost                 2.1.4
InĀ [49]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, auc, log_loss, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
import xgboost as xgb
import time
import pickle
import os
import warnings
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('ggplot')
InĀ [50]:
#Increase the dispaly size of outpus and dataframes
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

Let's load the training and test data:

InĀ [51]:
# Load data from CSV files
train_data  = pd.read_csv('C:\Important Files\Code and Software\Python Projects\DataIku\Data\census_income_learn.csv', header=None)
test_data  = pd.read_csv('C:\Important Files\Code and Software\Python Projects\DataIku\Data\census_income_test.csv', header=None)

print(test_data.shape, train_data.shape)


combined_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)

print("Combined data shape:", combined_data.shape)
train_data.head()
(99762, 42) (199523, 42)
Combined data shape: (299285, 42)
Out[51]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
0 73 Not in universe 0 0 High school graduate 0 Not in universe Widowed Not in universe or children Not in universe White All other Female Not in universe Not in universe Not in labor force 0 0 0 Nonfiler Not in universe Not in universe Other Rel 18+ ever marr not in subfamily Other relative of householder 1700.09 ? ? ? Not in universe under 1 year old ? 0 Not in universe United-States United-States United-States Native- Born in the United States 0 Not in universe 2 0 95 - 50000.
1 58 Self-employed-not incorporated 4 34 Some college but no degree 0 Not in universe Divorced Construction Precision production craft & repair White All other Male Not in universe Not in universe Children or Armed Forces 0 0 0 Head of household South Arkansas Householder Householder 1053.55 MSA to MSA Same county Same county No Yes 1 Not in universe United-States United-States United-States Native- Born in the United States 0 Not in universe 2 52 94 - 50000.
2 18 Not in universe 0 0 10th grade 0 High school Never married Not in universe or children Not in universe Asian or Pacific Islander All other Female Not in universe Not in universe Not in labor force 0 0 0 Nonfiler Not in universe Not in universe Child 18+ never marr Not in a subfamily Child 18 or older 991.95 ? ? ? Not in universe under 1 year old ? 0 Not in universe Vietnam Vietnam Vietnam Foreign born- Not a citizen of U S 0 Not in universe 2 0 95 - 50000.
3 9 Not in universe 0 0 Children 0 Not in universe Never married Not in universe or children Not in universe White All other Female Not in universe Not in universe Children or Armed Forces 0 0 0 Nonfiler Not in universe Not in universe Child <18 never marr not in subfamily Child under 18 never married 1758.14 Nonmover Nonmover Nonmover Yes Not in universe 0 Both parents present United-States United-States United-States Native- Born in the United States 0 Not in universe 0 0 94 - 50000.
4 10 Not in universe 0 0 Children 0 Not in universe Never married Not in universe or children Not in universe White All other Female Not in universe Not in universe Children or Armed Forces 0 0 0 Nonfiler Not in universe Not in universe Child <18 never marr not in subfamily Child under 18 never married 1069.16 Nonmover Nonmover Nonmover Yes Not in universe 0 Both parents present United-States United-States United-States Native- Born in the United States 0 Not in universe 0 0 94 - 50000.

Based on the metadata file, let's assign meaningful column names to our data:

InĀ [52]:
# Column names from metadata
column_names = [
    'age', 'class_of_worker', 'industry_code', 'occupation_code', 'education',
    'wage_per_hour', 'enrolled_in_edu_inst', 'marital_status', 'major_industry_code',
    'major_occupation_code', 'race', 'hispanic_origin', 'sex', 'member_of_labor_union',
    'reason_for_unemployment', 'full_or_part_time_employment', 'capital_gains',
    'capital_losses', 'dividends_from_stocks', 'tax_filer_status', 'region_of_previous_residence',
    'state_of_previous_residence', 'detailed_household_summary', 'detailed_household_summary_in_household',
    'instance_weight', 'migration_code_change_in_msa', 'migration_code_change_in_reg',
    'migration_code_move_within_reg', 'live_in_this_house_1_year_ago', 'migration_prev_res_in_sunbelt',
    'num_persons_worked_for_employer', 'family_members_under_18', 'country_of_birth_father',
    'country_of_birth_mother', 'country_of_birth_self', 'citizenship', 'own_business_or_self_employed',
    'fill_inc_questionnaire_for_veteran', 'veterans_benefits', 'weeks_worked_in_year', 'year', 'income'
]

# Apply column names
train_data.columns = column_names
test_data.columns = column_names
combined_data.columns = column_names

# Check the income target distribution
combined_data['income'].value_counts()
Out[52]:
income
- 50000.    280717
50000+.      18568
Name: count, dtype: int64

2. Data Check¶

Let's analyze the data to understand its structure, missing values, and data types.

InĀ [53]:
# Check data types and missing data
def analyze_dataframe(df):
    # Replace blank values with NaN
    df = df.replace(r'^\s*$', np.nan, regex=True)
    
    # Get the data type for each column
    data_types = df.dtypes
    
    # Get the number of missing values for each column
    missing_values = df.isnull().sum()
    
    # Calculate the percentage of missing values for each column
    missing_percentage = (missing_values / len(df)) * 100
    
    # Get the number of unique values for each column
    unique_values = df.nunique()
    
    # Combine all the information into a single DataFrame
    analysis_df = pd.DataFrame({
        'Data Type': data_types,
        'Missing Values': missing_values,
        '% Missing': missing_percentage,
        'Unique Values': unique_values
    })
    
    return analysis_df

def count_data_types(df):
    # Get the data type for each column and count occurrences of each type
    data_type_counts = df.dtypes.value_counts()
    return data_type_counts

# Analyze the DataFrame
analysis_result = analyze_dataframe(combined_data)

# Print the result
print("Data Analysis:")
print(analysis_result)

# Count data types
data_type_counts = count_data_types(combined_data)

# Print the data type counts
print("\nData Type Counts:")
print(data_type_counts)
Data Analysis:
                                        Data Type  Missing Values  % Missing  Unique Values
age                                         int64               0        0.0             91
class_of_worker                            object               0        0.0              9
industry_code                               int64               0        0.0             52
occupation_code                             int64               0        0.0             47
education                                  object               0        0.0             17
wage_per_hour                               int64               0        0.0           1425
enrolled_in_edu_inst                       object               0        0.0              3
marital_status                             object               0        0.0              7
major_industry_code                        object               0        0.0             24
major_occupation_code                      object               0        0.0             15
race                                       object               0        0.0              5
hispanic_origin                            object               0        0.0             10
sex                                        object               0        0.0              2
member_of_labor_union                      object               0        0.0              3
reason_for_unemployment                    object               0        0.0              6
full_or_part_time_employment               object               0        0.0              8
capital_gains                               int64               0        0.0            133
capital_losses                              int64               0        0.0            114
dividends_from_stocks                       int64               0        0.0           1675
tax_filer_status                           object               0        0.0              6
region_of_previous_residence               object               0        0.0              6
state_of_previous_residence                object               0        0.0             51
detailed_household_summary                 object               0        0.0             38
detailed_household_summary_in_household    object               0        0.0              8
instance_weight                           float64               0        0.0         123232
migration_code_change_in_msa               object               0        0.0             10
migration_code_change_in_reg               object               0        0.0              9
migration_code_move_within_reg             object               0        0.0             10
live_in_this_house_1_year_ago              object               0        0.0              3
migration_prev_res_in_sunbelt              object               0        0.0              4
num_persons_worked_for_employer             int64               0        0.0              7
family_members_under_18                    object               0        0.0              5
country_of_birth_father                    object               0        0.0             43
country_of_birth_mother                    object               0        0.0             43
country_of_birth_self                      object               0        0.0             43
citizenship                                object               0        0.0              5
own_business_or_self_employed               int64               0        0.0              3
fill_inc_questionnaire_for_veteran         object               0        0.0              3
veterans_benefits                           int64               0        0.0              3
weeks_worked_in_year                        int64               0        0.0             53
year                                        int64               0        0.0              2
income                                     object               0        0.0              2

Data Type Counts:
object     29
int64      12
float64     1
Name: count, dtype: int64

Let's check for duplicate rows in our data and drop them consistently across sets:

We need to track dropped rows to ensure consistency between train and test sets.

InĀ [54]:
# Function to track and consistently drop duplicates across datasets
def track_and_drop_duplicates(combined_df, train_df, test_df):
    # Get initial shapes
    print(f"Before: Combined data shape: {combined_df.shape}")
    print(f"Before: Train data shape: {train_df.shape}")
    print(f"Before: Test data shape: {test_df.shape}")
    
    # Check initial duplicates
    duplicates_count = combined_df.duplicated().sum()
    print(f"Number of duplicate rows in combined dataset: {duplicates_count}")
    
    # Get the indices of the duplicate rows in the combined dataset
    duplicate_mask = combined_df.duplicated(keep='first')
    duplicate_indices = combined_df[duplicate_mask].index.tolist()
    
    # Split these indices into train and test
    train_size = train_df.shape[0]
    train_duplicate_indices = [idx for idx in duplicate_indices if idx < train_size]
    test_duplicate_indices = [idx - train_size for idx in duplicate_indices if idx >= train_size]
    
    print(f"Duplicates in train: {len(train_duplicate_indices)}")
    print(f"Duplicates in test: {len(test_duplicate_indices)}")
    
    # Drop duplicates from all datasets
    combined_df_clean = combined_df.drop_duplicates(keep='first').reset_index(drop=True)
    train_df_clean = train_df.drop(index=train_duplicate_indices, errors='ignore').reset_index(drop=True)
    test_df_clean = test_df.drop(index=test_duplicate_indices, errors='ignore').reset_index(drop=True)
    
    # Cross-check consistency
    print(f"After: Combined data shape: {combined_df_clean.shape}")
    print(f"After: Train data shape: {train_df_clean.shape}")
    print(f"After: Test data shape: {test_df_clean.shape}")
    print(f"Sum of train + test: {train_df_clean.shape[0] + test_df_clean.shape[0]}")
    
    return combined_df_clean, train_df_clean, test_df_clean

# Apply the function
combined_data_clean, train_data_clean, test_data_clean = track_and_drop_duplicates(
    combined_data, train_data, test_data
)
Before: Combined data shape: (299285, 42)
Before: Train data shape: (199523, 42)
Before: Test data shape: (99762, 42)
Number of duplicate rows in combined dataset: 6735
Duplicates in train: 3229
Duplicates in test: 3506
After: Combined data shape: (292550, 42)
After: Train data shape: (196294, 42)
After: Test data shape: (96256, 42)
Sum of train + test: 292550

3. Exploratory Data Analysis (EDA)¶

3.1 Target Variable Distribution¶

First, let's examine the distribution of our target variable (income).

InĀ [79]:
# Target variable distribution - Fixed version
plt.figure(figsize=(10, 6))
income_counts = combined_data['income'].value_counts()
plt.bar(income_counts.index, income_counts.values, color=['steelblue', 'coral'])
plt.title('Distribution of Income Level', fontsize=15)
plt.xlabel('Income Level', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)

# Add percentage labels
total = income_counts.sum()
for i, count in enumerate(income_counts.values):
    percentage = count / total * 100
    plt.text(i, count, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=12)

plt.tight_layout()
plt.show()

# Print the exact numbers - using actual values in the dataset
print(f"Income distribution:\n{income_counts}")
# Check actual values in the dataset
print(f"Unique income values: {combined_data['income'].unique()}")
# Calculate percentages using the actual values in the dataset
for income_level in combined_data['income'].unique():
    percentage = (combined_data['income'] == income_level).mean() * 100
    print(f"Percentage {income_level}: {percentage:.2f}%")
No description has been provided for this image
Income distribution:
income
- 50000.    280717
50000+.      18568
Name: count, dtype: int64
Unique income values: [' - 50000.' ' 50000+.']
Percentage  - 50000.: 93.80%
Percentage  50000+.: 6.20%

3.2 Numerical Features Analysis¶

Let's examine the distributions of numerical features and their relationships with income.

InĀ [80]:
# Identify numerical columns based on metadata (7 continuous variables)
numerical_columns = ['age', 'wage_per_hour', 'capital_gains', 'capital_losses', 
                     'dividends_from_stocks', 'num_persons_worked_for_employer', 'weeks_worked_in_year']

# Create histograms for numerical features
plt.figure(figsize=(20, 15))

for i, col in enumerate(numerical_columns):
    plt.subplot(3, 3, i+1)
    sns.histplot(combined_data[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.tight_layout()

plt.show()

# Box plots for numerical features by income
plt.figure(figsize=(20, 15))

for i, col in enumerate(numerical_columns):
    plt.subplot(3, 3, i+1)
    sns.boxplot(x='income', y=col, data=combined_data)
    plt.title(f'{col} by Income Level')
    plt.tight_layout()

plt.show()

# Correlation matrix for numerical features
plt.figure(figsize=(12, 10))
correlation_matrix = combined_data[numerical_columns].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features', fontsize=15)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

3.3 Categorical Features Analysis¶

Now let's examine some key categorical features and their relationships with income.

InĀ [81]:
# Select important categorical features for analysis
categorical_columns = [
    'class_of_worker',
    'industry_code',
    'occupation_code',
    'education',
    'enrolled_in_edu_inst',
    'marital_status',
    'major_industry_code',
    'major_occupation_code',
    'race',
    'hispanic_origin',
    'sex',
    'member_of_labor_union',
    'reason_for_unemployment',
    'full_or_part_time_employment',
    'tax_filer_status',
    'region_of_previous_residence',
    'state_of_previous_residence',
    'detailed_household_summary',
    'detailed_household_summary_in_household',
    'migration_code_change_in_msa',
    'migration_code_change_in_reg',
    'migration_code_move_within_reg',
    'live_in_this_house_1_year_ago',
    'migration_prev_res_in_sunbelt',
    'family_members_under_18',
    'country_of_birth_father',
    'country_of_birth_mother',
    'country_of_birth_self',
    'citizenship',
    'own_business_or_self_employed',
    'fill_inc_questionnaire_for_veteran',
    'veterans_benefits',
    'year'
]

# Create count plots for categorical features
for col in categorical_columns:
    plt.figure(figsize=(12, 6))
    
    # Get value counts for the feature
    value_counts = combined_data[col].value_counts().nlargest(10)
    
    # Plot the 10 most common values
    sns.countplot(y=col, data=combined_data, order=value_counts.index)
    plt.title(f'Distribution of {col} (Top 10 Categories)', fontsize=15)
    plt.tight_layout()
    plt.show()
    
    # Stacked bar chart showing income distribution by category
    plt.figure(figsize=(14, 8))
    
    # Prepare data for stacked bar chart
    cross_tab = pd.crosstab(
        combined_data[col], 
        combined_data['income'],
        normalize='index'
    ) * 100  # Convert to percentage
    
    # Plot only the top 10 categories
    cross_tab.loc[value_counts.index].plot(kind='barh', stacked=True, 
                                          colormap='coolwarm')
    
    plt.title(f'Income Distribution by {col} (Top 10 Categories)', fontsize=15)
    plt.xlabel('Percentage', fontsize=12)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
<Figure size 1400x800 with 0 Axes>
No description has been provided for this image

4. Data Preparation¶

Now that we've explored the data, let's prepare it for modeling.

4.1 Data Cleaning¶

InĀ [58]:
# Function to clean data - fixed with correct values
def clean_data(df):
    # Make a copy of the dataframe
    df_clean = df.copy()
    
    # Convert target to binary with the correct values (note the leading space)
    df_clean['income'] = df_clean['income'].map({' - 50000.': 0, ' 50000+.': 1})
    
    # Replace '?' values with np.nan
    df_clean = df_clean.replace('?', np.nan)
    
    return df_clean

# Clean the training and test data
train_data_clean = clean_data(train_data_clean)
test_data_clean = clean_data(test_data_clean)
combined_data_clean = clean_data(combined_data_clean)

# Check missing values after cleaning
train_missing = train_data_clean.isnull().sum()[train_data_clean.isnull().sum() > 0]
print("Missing values in training data:")
print(train_missing)
Missing values in training data:
Series([], dtype: int64)

4.2 Identify Categorical Features and Their Unique Values¶

InĀ [59]:
categorical_columns = [
    'class_of_worker', 'industry_code', 'occupation_code', 'education',
    'enrolled_in_edu_inst', 'marital_status', 'major_industry_code',
    'major_occupation_code', 'race', 'hispanic_origin', 'sex',
    'member_of_labor_union', 'reason_for_unemployment',
    'full_or_part_time_employment', 'tax_filer_status',
    'region_of_previous_residence', 'state_of_previous_residence',
    'detailed_household_summary', 'detailed_household_summary_in_household',
    'migration_code_change_in_msa', 'migration_code_change_in_reg',
    'migration_code_move_within_reg', 'live_in_this_house_1_year_ago',
    'migration_prev_res_in_sunbelt', 'family_members_under_18',
    'country_of_birth_father', 'country_of_birth_mother',
    'country_of_birth_self', 'citizenship', 'own_business_or_self_employed',
    'fill_inc_questionnaire_for_veteran', 'veterans_benefits', 'year'
]

# Function to identify categorical features and their unique values
def identify_categorical_features(df):
    # Identify categorical columns   
    print(f"Identified {len(categorical_columns)} categorical features")
    
    # Analyze unique values for each categorical column
    for col in categorical_columns:
        n_unique = df[col].nunique()
        unique_values = df[col].unique()
        print(f"{col}: {n_unique} unique values")
        print(f"Unique values: {unique_values}")
        print("-" * 50)
    
    return categorical_columns

# Apply the function to see unique values
categorical_columns = identify_categorical_features(combined_data_clean)
Identified 33 categorical features
class_of_worker: 9 unique values
Unique values: [' Not in universe' ' Self-employed-not incorporated' ' Private'
 ' Local government' ' Federal government' ' Self-employed-incorporated'
 ' State government' ' Never worked' ' Without pay']
--------------------------------------------------
industry_code: 52 unique values
Unique values: [ 0  4 40 34 43 37 24 39 12 35 45  3 19 29 32 48 33 23 44 36 31 30 41  5
 11  9 42  6 18 50  2  1 26 47 16 14 22 17  7  8 25 46 27 15 13 49 38 21
 28 20 51 10]
--------------------------------------------------
occupation_code: 47 unique values
Unique values: [ 0 34 10  3 40 26 37 31 12 36 41 22  2 35 25 23 42  8 19 29 27 16 33 13
 18  9 17 39 32 11 30 38 20  7 21 44 24 43 28  4  1  6 45 14  5 15 46]
--------------------------------------------------
education: 17 unique values
Unique values: [' High school graduate' ' Some college but no degree' ' 10th grade'
 ' Children' ' Bachelors degree(BA AB BS)'
 ' Masters degree(MA MS MEng MEd MSW MBA)' ' Less than 1st grade'
 ' Associates degree-academic program' ' 7th and 8th grade'
 ' 12th grade no diploma' ' Associates degree-occup /vocational'
 ' Prof school degree (MD DDS DVM LLB JD)' ' 5th or 6th grade'
 ' 11th grade' ' Doctorate degree(PhD EdD)' ' 9th grade'
 ' 1st 2nd 3rd or 4th grade']
--------------------------------------------------
enrolled_in_edu_inst: 3 unique values
Unique values: [' Not in universe' ' High school' ' College or university']
--------------------------------------------------
marital_status: 7 unique values
Unique values: [' Widowed' ' Divorced' ' Never married'
 ' Married-civilian spouse present' ' Separated' ' Married-spouse absent'
 ' Married-A F spouse present']
--------------------------------------------------
major_industry_code: 24 unique values
Unique values: [' Not in universe or children' ' Construction' ' Entertainment'
 ' Finance insurance and real estate' ' Education'
 ' Business and repair services' ' Manufacturing-nondurable goods'
 ' Personal services except private HH' ' Manufacturing-durable goods'
 ' Other professional services' ' Mining' ' Transportation'
 ' Wholesale trade' ' Public administration' ' Retail trade'
 ' Social services' ' Private household services'
 ' Utilities and sanitary services' ' Communications' ' Hospital services'
 ' Medical except hospital' ' Agriculture' ' Forestry and fisheries'
 ' Armed Forces']
--------------------------------------------------
major_occupation_code: 15 unique values
Unique values: [' Not in universe' ' Precision production craft & repair'
 ' Professional specialty' ' Executive admin and managerial'
 ' Handlers equip cleaners etc ' ' Adm support including clerical'
 ' Machine operators assmblrs & inspctrs' ' Other service' ' Sales'
 ' Private household services' ' Technicians and related support'
 ' Transportation and material moving' ' Farming forestry and fishing'
 ' Protective services' ' Armed Forces']
--------------------------------------------------
race: 5 unique values
Unique values: [' White' ' Asian or Pacific Islander' ' Amer Indian Aleut or Eskimo'
 ' Black' ' Other']
--------------------------------------------------
hispanic_origin: 10 unique values
Unique values: [' All other' ' Do not know' ' Central or South American'
 ' Mexican (Mexicano)' ' Mexican-American' ' Other Spanish'
 ' Puerto Rican' ' Cuban' ' Chicano' ' NA']
--------------------------------------------------
sex: 2 unique values
Unique values: [' Female' ' Male']
--------------------------------------------------
member_of_labor_union: 3 unique values
Unique values: [' Not in universe' ' No' ' Yes']
--------------------------------------------------
reason_for_unemployment: 6 unique values
Unique values: [' Not in universe' ' Job loser - on layoff' ' Other job loser'
 ' New entrant' ' Re-entrant' ' Job leaver']
--------------------------------------------------
full_or_part_time_employment: 8 unique values
Unique values: [' Not in labor force' ' Children or Armed Forces' ' Full-time schedules'
 ' Unemployed full-time' ' Unemployed part- time'
 ' PT for non-econ reasons usually FT' ' PT for econ reasons usually PT'
 ' PT for econ reasons usually FT']
--------------------------------------------------
tax_filer_status: 6 unique values
Unique values: [' Nonfiler' ' Head of household' ' Joint both under 65' ' Single'
 ' Joint both 65+' ' Joint one under 65 & one 65+']
--------------------------------------------------
region_of_previous_residence: 6 unique values
Unique values: [' Not in universe' ' South' ' Northeast' ' Midwest' ' West' ' Abroad']
--------------------------------------------------
state_of_previous_residence: 51 unique values
Unique values: [' Not in universe' ' Arkansas' ' Utah' ' Michigan' ' Minnesota' ' Alaska'
 ' Kansas' ' Indiana' ' ?' ' Massachusetts' ' New Mexico' ' Nevada'
 ' Tennessee' ' Colorado' ' Abroad' ' Kentucky' ' California' ' Arizona'
 ' North Carolina' ' Connecticut' ' Florida' ' Vermont' ' Maryland'
 ' Oklahoma' ' Oregon' ' Ohio' ' South Carolina' ' Texas' ' Montana'
 ' Wyoming' ' Georgia' ' Pennsylvania' ' Iowa' ' New Hampshire'
 ' Missouri' ' Alabama' ' North Dakota' ' New Jersey' ' Louisiana'
 ' West Virginia' ' Delaware' ' Illinois' ' Maine' ' Wisconsin'
 ' New York' ' Idaho' ' District of Columbia' ' South Dakota' ' Nebraska'
 ' Virginia' ' Mississippi']
--------------------------------------------------
detailed_household_summary: 38 unique values
Unique values: [' Other Rel 18+ ever marr not in subfamily' ' Householder'
 ' Child 18+ never marr Not in a subfamily'
 ' Child <18 never marr not in subfamily' ' Spouse of householder'
 ' Secondary individual' ' Other Rel 18+ never marr not in subfamily'
 ' Nonfamily householder' ' Grandchild <18 never marr not in subfamily'
 ' Grandchild <18 never marr child of subfamily RP'
 ' Child 18+ ever marr Not in a subfamily'
 ' Child 18+ never marr RP of subfamily'
 ' Child 18+ spouse of subfamily RP'
 ' Other Rel <18 never marr child of subfamily RP'
 ' Child under 18 of RP of unrel subfamily'
 ' Grandchild 18+ never marr not in subfamily'
 ' Child 18+ ever marr RP of subfamily'
 ' Other Rel 18+ ever marr RP of subfamily' ' RP of unrelated subfamily'
 ' Other Rel 18+ spouse of subfamily RP'
 ' Other Rel <18 never marr not in subfamily'
 ' Other Rel <18 spouse of subfamily RP' ' In group quarters'
 ' Grandchild 18+ spouse of subfamily RP'
 ' Other Rel 18+ never marr RP of subfamily'
 ' Child <18 never marr RP of subfamily'
 ' Child <18 ever marr not in subfamily'
 ' Other Rel <18 ever marr RP of subfamily'
 ' Grandchild 18+ ever marr not in subfamily'
 ' Child <18 spouse of subfamily RP'
 ' Spouse of RP of unrelated subfamily'
 ' Other Rel <18 never married RP of subfamily'
 ' Grandchild 18+ never marr RP of subfamily'
 ' Grandchild 18+ ever marr RP of subfamily'
 ' Child <18 ever marr RP of subfamily'
 ' Other Rel <18 ever marr not in subfamily'
 ' Grandchild <18 never marr RP of subfamily'
 ' Grandchild <18 ever marr not in subfamily']
--------------------------------------------------
detailed_household_summary_in_household: 8 unique values
Unique values: [' Other relative of householder' ' Householder' ' Child 18 or older'
 ' Child under 18 never married' ' Spouse of householder'
 ' Nonrelative of householder' ' Group Quarters- Secondary individual'
 ' Child under 18 ever married']
--------------------------------------------------
migration_code_change_in_msa: 10 unique values
Unique values: [' ?' ' MSA to MSA' ' Nonmover' ' NonMSA to nonMSA' ' Not in universe'
 ' Not identifiable' ' Abroad to MSA' ' MSA to nonMSA' ' Abroad to nonMSA'
 ' NonMSA to MSA']
--------------------------------------------------
migration_code_change_in_reg: 9 unique values
Unique values: [' ?' ' Same county' ' Nonmover' ' Different region'
 ' Different county same state' ' Not in universe'
 ' Different division same region' ' Abroad'
 ' Different state same division']
--------------------------------------------------
migration_code_move_within_reg: 10 unique values
Unique values: [' ?' ' Same county' ' Nonmover' ' Different state in South'
 ' Different county same state' ' Not in universe'
 ' Different state in Northeast' ' Abroad' ' Different state in Midwest'
 ' Different state in West']
--------------------------------------------------
live_in_this_house_1_year_ago: 3 unique values
Unique values: [' Not in universe under 1 year old' ' No' ' Yes']
--------------------------------------------------
migration_prev_res_in_sunbelt: 4 unique values
Unique values: [' ?' ' Yes' ' Not in universe' ' No']
--------------------------------------------------
family_members_under_18: 5 unique values
Unique values: [' Not in universe' ' Both parents present' ' Mother only present'
 ' Neither parent present' ' Father only present']
--------------------------------------------------
country_of_birth_father: 43 unique values
Unique values: [' United-States' ' Vietnam' ' Philippines' ' ?' ' Columbia' ' Germany'
 ' Mexico' ' Japan' ' Peru' ' Dominican-Republic' ' South Korea' ' Cuba'
 ' El-Salvador' ' Canada' ' Scotland' ' Outlying-U S (Guam USVI etc)'
 ' Italy' ' Guatemala' ' Ecuador' ' Puerto-Rico' ' Cambodia' ' China'
 ' Poland' ' Nicaragua' ' Taiwan' ' England' ' Ireland' ' Hungary'
 ' Yugoslavia' ' Trinadad&Tobago' ' Jamaica' ' Honduras' ' Portugal'
 ' Iran' ' France' ' India' ' Hong Kong' ' Haiti' ' Greece'
 ' Holand-Netherlands' ' Thailand' ' Laos' ' Panama']
--------------------------------------------------
country_of_birth_mother: 43 unique values
Unique values: [' United-States' ' Vietnam' ' ?' ' Columbia' ' Mexico' ' El-Salvador'
 ' Peru' ' Puerto-Rico' ' Cuba' ' Philippines' ' Dominican-Republic'
 ' Germany' ' England' ' Guatemala' ' Scotland' ' Portugal' ' Italy'
 ' Ecuador' ' Yugoslavia' ' China' ' Poland' ' Hungary' ' Nicaragua'
 ' Taiwan' ' Ireland' ' Canada' ' South Korea' ' Trinadad&Tobago'
 ' Jamaica' ' Honduras' ' Iran' ' France' ' Cambodia' ' India'
 ' Hong Kong' ' Haiti' ' Japan' ' Greece' ' Holand-Netherlands'
 ' Thailand' ' Panama' ' Laos' ' Outlying-U S (Guam USVI etc)']
--------------------------------------------------
country_of_birth_self: 43 unique values
Unique values: [' United-States' ' Vietnam' ' ?' ' Columbia' ' Mexico' ' Peru' ' Cuba'
 ' Philippines' ' Dominican-Republic' ' El-Salvador' ' Canada' ' Scotland'
 ' Portugal' ' Guatemala' ' Ecuador' ' Germany'
 ' Outlying-U S (Guam USVI etc)' ' Puerto-Rico' ' Italy' ' China'
 ' Poland' ' Nicaragua' ' Taiwan' ' England' ' Ireland' ' South Korea'
 ' Trinadad&Tobago' ' Jamaica' ' Honduras' ' Iran' ' Hungary' ' France'
 ' Cambodia' ' India' ' Hong Kong' ' Japan' ' Haiti' ' Holand-Netherlands'
 ' Greece' ' Thailand' ' Panama' ' Yugoslavia' ' Laos']
--------------------------------------------------
citizenship: 5 unique values
Unique values: [' Native- Born in the United States'
 ' Foreign born- Not a citizen of U S '
 ' Foreign born- U S citizen by naturalization'
 ' Native- Born abroad of American Parent(s)'
 ' Native- Born in Puerto Rico or U S Outlying']
--------------------------------------------------
own_business_or_self_employed: 3 unique values
Unique values: [0 2 1]
--------------------------------------------------
fill_inc_questionnaire_for_veteran: 3 unique values
Unique values: [' Not in universe' ' No' ' Yes']
--------------------------------------------------
veterans_benefits: 3 unique values
Unique values: [2 0 1]
--------------------------------------------------
year: 2 unique values
Unique values: [95 94]
--------------------------------------------------

4.3 Feature Engineering¶

InĀ [60]:
# Enhanced feature engineering function with integrated label encoding
def engineer_features(df):
    df = df.copy()
       
    # Create work experience feature (assuming people start working at age 18)
    df['work_experience'] = df['age'] - 18
    df.loc[df['work_experience'] < 0, 'work_experience'] = 0
    
    # Create a feature for capital gains/losses ratio
    df['capital_ratio'] = df['capital_gains'] / (df['capital_losses'] + 1)  # Adding 1 to avoid division by zero
      
    # Binary feature for full year worker
    df['full_year_worker'] = (df['weeks_worked_in_year'] >= 50).astype(int)
    
    # Create binary features for capital gains/losses and dividends
    df['has_capital_gains'] = (df['capital_gains'] > 7000).astype(int)
    df['has_capital_losses'] = (df['capital_losses'] > 0).astype(int)
    df['has_dividends'] = (df['dividends_from_stocks'] > 0).astype(int)
     
    # Marital status simplified
    df['is_married'] = df['marital_status'].str.contains('Married', case=False).astype(int)
    return df

# Apply feature engineering
train_data_fe = engineer_features(train_data_clean)
test_data_fe = engineer_features(test_data_clean)
combined_data_fe = engineer_features(combined_data_clean)

# Check the new features
print("Engineered features added. New dataframe shape:", train_data_fe.shape)
Engineered features added. New dataframe shape: (196294, 49)

4.4 Encode Categorical Variables¶

InĀ [61]:
def encode_categorical(df):
    df = df.copy()
    label_encoded_columns = []  # Track columns that were label encoded

    # 1. Label Encoding
    def label_encode(df):
        # Define mappings
        education_mapping = {
            ' Less than 1st grade': 0, ' 1st 2nd 3rd or 4th grade': 1,
            ' 5th or 6th grade': 2, ' 7th and 8th grade': 3,
            ' 9th grade': 4, ' 10th grade': 5, ' 11th grade': 6,
            ' 12th grade no diploma': 7, ' High school graduate': 8,
            ' Some college but no degree': 9, ' Associates degree-occup /vocational': 10,
            ' Associates degree-academic program': 11, ' Bachelors degree(BA AB BS)': 12,
            ' Masters degree(MA MS MEng MEd MSW MBA)': 13,
            ' Prof school degree (MD DDS DVM LLB JD)': 14,
            ' Doctorate degree(PhD EdD)': 15, ' Children': -1}
        
        # Company size categorical feature
        company_size_map = {
            'Not in universe': 0,
            'under 10': 1,
            '10 - 24': 2,
            '25 - 99': 3,
            '100 - 499': 4,
            '500 - 999': 5,
            '1000+': 6
        }
        
        simple_mapping = {
            ' Not in universe': 0, ' No': 1, ' Yes': 2
        }
        
        enrolled_mapping = {
            ' Not in universe': 0, ' High school': 1, ' College or university': 2}
        
        live_in_house_mapping = {
            ' Not in universe under 1 year old': 0, ' No': 1, ' Yes': 2}

        # Apply mappings
        label_encode_cols = {
            'sex': {' Female': 0, ' Male': 1},
            'education': education_mapping,
            'enrolled_in_edu_inst': enrolled_mapping,
            'member_of_labor_union': simple_mapping,
            'live_in_this_house_1_year_ago': live_in_house_mapping,
            'fill_inc_questionnaire_for_veteran': simple_mapping,
            'num_persons_worked_for_employer': company_size_map
        }
        
        # Create new columns with encoded values
        encoded_columns = []
        for col, mapping in label_encode_cols.items():
            if col in df.columns:
                new_col_name = f"{col}_encoded"
                df[new_col_name] = df[col].map(mapping)
                encoded_columns.append(col)
        
        return df, encoded_columns, list(label_encode_cols.keys())

    # Apply label encoding and get list of columns that were encoded
    df, label_encoded_columns, label_encode_keys = label_encode(df)

    # 2. One-Hot Encoding
    # Identify categorical columns, excluding 'income'
    categorical_cols = [col for col in df.columns if 
                        df[col].dtype == 'object' and col != 'income' and 
                        not col.endswith('_encoded')]

    # One-hot encode remaining categorical columns
    if categorical_cols:
        encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        encoded = encoder.fit_transform(df[categorical_cols])
        encoded_df = pd.DataFrame(
            encoded, 
            columns=encoder.get_feature_names_out(categorical_cols),
            index=df.index
        )
        df = pd.concat([df, encoded_df], axis=1)
    
    # 3. Numeric Conversion
    numeric_cols = ['occupation_code', 'industry_code', 'year', 'veterans_benefits', 'own_business_or_self_employed']
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # 4. Drop original categorical columns
    cols_to_drop = categorical_cols + label_encode_keys
    df = df.drop(columns=cols_to_drop, errors='ignore')

    return df

# Encode all datasets
train_data_enc = encode_categorical(train_data_fe)
test_data_enc = encode_categorical(test_data_fe)
combined_data_enc = encode_categorical(combined_data_fe)

print("Engineered features added. Train dataframe shape:", train_data_enc.shape)
print("Engineered features added. Test dataframe shape:", test_data_enc.shape)
print("Engineered features added. Combined dataframe shape:", combined_data_enc.shape)
Engineered features added. Train dataframe shape: (196294, 423)
Engineered features added. Test dataframe shape: (96256, 422)
Engineered features added. Combined dataframe shape: (292550, 423)

4.4.1 Drop new columns not common in both train and test¶

InĀ [62]:
def align_columns(train_df, test_df):
    """
    Ensures that the training and test DataFrames have the same columns.

    Args:
        train_df (pd.DataFrame): The training DataFrame.
        test_df (pd.DataFrame): The test DataFrame.

    Returns:
        tuple: The modified training and test DataFrames with aligned columns.
    """

    train_cols = set(train_df.columns)
    test_cols = set(test_df.columns)

    # Find columns unique to train
    train_unique = train_cols - test_cols

    # Find columns unique to test
    test_unique = test_cols - train_cols

    # Print columns to drop from train
    if train_unique:
        print("Columns unique to training data:")
        for col in train_unique:
            print(f"  - {col}")
        train_df = train_df.drop(columns=train_unique, errors='ignore')

    # Print columns to drop from test
    if test_unique:
        print("\nColumns unique to test data:")
        for col in test_unique:
            print(f"  - {col}")
        test_df = test_df.drop(columns=test_unique, errors='ignore')

    return train_df, test_df

# Align columns between training and test sets
train_data_aligned, test_data_aligned = align_columns(train_data_enc, test_data_enc)

# Verify the shapes after alignment
print("\nTraining data shape after alignment:", train_data_aligned.shape)
print("Test data shape after alignment:", test_data_aligned.shape)
Columns unique to training data:
  - detailed_household_summary_ Grandchild <18 ever marr not in subfamily

Training data shape after alignment: (196294, 422)
Test data shape after alignment: (96256, 422)

4.4.2 Drop Features with High Null Percentage¶

InĀ [63]:
def drop_high_null_features(train_df, test_df, threshold=0.3):
    """
    Removes features that have a percentage of null values above the specified threshold.
    
    Args:
        train_df (pd.DataFrame): Training dataframe
        test_df (pd.DataFrame): Test dataframe
        threshold (float): Maximum allowed percentage of nulls (0.0 to 1.0)
        
    Returns:
        tuple: Clean training and test dataframes with high-null features removed
    """
    # Make copies to avoid modifying the originals
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Calculate null percentages for each column in training data
    null_percentages = train_df.isnull().mean()
    
    # Identify columns to drop (excluding 'income')
    high_null_cols = null_percentages[
        (null_percentages > threshold) & 
        (null_percentages.index != 'income')
    ].index.tolist()
    
    # Log the columns being dropped
    if high_null_cols:
        print(f"\nRemoving {len(high_null_cols)} features with >{threshold*100:.1f}% null values:")
        for col in high_null_cols:
            print(f"  - {col}: {null_percentages[col]*100:.2f}% nulls")
        
        # Drop the identified columns from both datasets
        train_df = train_df.drop(columns=high_null_cols)
        test_df = test_df.drop(columns=high_null_cols)
        
        print(f"\nAfter dropping high-null features:")
        print(f"  Training shape: {train_df.shape}")
        print(f"  Test shape: {test_df.shape}")
    else:
        print(f"\nNo features exceeded the {threshold*100:.1f}% null threshold")
    
    return train_df, test_df

# Apply the function to remove high-null features
train_data_nulls_dropped, test_data_nulls_dropped = drop_high_null_features(
    train_data_aligned, test_data_aligned, threshold=0.3
)

# Continue with the clean datasets
# (Rename the variables to maintain naming consistency with subsequent steps)
train_data_aligned = train_data_nulls_dropped
test_data_aligned = test_data_nulls_dropped
Removing 1 features with >30.0% null values:
  - num_persons_worked_for_employer_encoded: 100.00% nulls

After dropping high-null features:
  Training shape: (196294, 421)
  Test shape: (96256, 421)

4.5 Split Data¶

InĀ [64]:
# Limit dataset size if needed
# train_data_aligned = train_data_aligned.iloc[:50000]
# test_data_aligned = test_data_aligned[:50000]
InĀ [65]:
# Split into train/test
X_train = train_data_aligned.drop('income', axis=1)
y_train = train_data_aligned['income']
X_test = test_data_aligned.drop('income', axis=1)
y_test = test_data_aligned['income']

# Create validation set
# Create validation set with stratification
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, 
                                                  random_state=42, stratify=y_train)

print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
(157035, 420) (157035,) (39259, 420) (39259,)

4.6 Handle Missing Values¶

InĀ [66]:
# Identify numeric and categorical features
num_features = X_train.select_dtypes(include=np.number).columns.tolist()
cat_features = X_train.select_dtypes(include='object').columns.tolist()

# Imputation
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

# Apply imputation to numeric features
if num_features:
    X_train[num_features] = num_imputer.fit_transform(X_train[num_features])
    X_val[num_features] = num_imputer.transform(X_val[num_features])
    X_test[num_features] = num_imputer.transform(X_test[num_features])

# Apply imputation to categorical features
if cat_features:
    X_train[cat_features] = cat_imputer.fit_transform(X_train[cat_features])
    X_val[cat_features] = cat_imputer.transform(X_val[cat_features])
    X_test[cat_features] = cat_imputer.transform(X_test[cat_features])

# Verify imputation was successful
train_missing = X_train.isnull().sum().sum()
val_missing = X_val.isnull().sum().sum()
test_missing = X_test.isnull().sum().sum()

print("Missing values after imputation:")
print(f"Training data: {train_missing}")
print(f"Validation data: {val_missing}")
print(f"Test data: {test_missing}")

# If any missing values remain, apply a second pass of imputation
if train_missing > 0 or val_missing > 0 or test_missing > 0:
    print("Warning: Some values couldn't be imputed, applying a fallback strategy")
    
    # Apply a more aggressive fallback imputation strategy
    fallback_imputer = SimpleImputer(strategy='constant', fill_value=0)
    
    # Get columns that still have missing values
    train_missing_cols = X_train.columns[X_train.isnull().any()].tolist()
    val_missing_cols = X_val.columns[X_val.isnull().any()].tolist()
    test_missing_cols = X_test.columns[X_test.isnull().any()].tolist()
    
    all_missing_cols = list(set(train_missing_cols + val_missing_cols + test_missing_cols))
    
    if all_missing_cols:
        X_train[all_missing_cols] = fallback_imputer.fit_transform(X_train[all_missing_cols])
        X_val[all_missing_cols] = fallback_imputer.transform(X_val[all_missing_cols])
        X_test[all_missing_cols] = fallback_imputer.transform(X_test[all_missing_cols])
    
    # Check again
    print("Missing values after fallback imputation:")
    print(f"Training data: {X_train.isnull().sum().sum()}")
    print(f"Validation data: {X_val.isnull().sum().sum()}")
    print(f"Test data: {X_test.isnull().sum().sum()}")
Missing values after imputation:
Training data: 0
Validation data: 0
Test data: 0

4.7 Handle Outliers¶

InĀ [67]:
def handle_outliers(df, y=None, dataset_name=""):
    # Track initial shape
    initial_shape = df.shape
    print(f"\nProcessing {dataset_name} dataset:")
    print(f"Initial shape: {initial_shape}")

    # Create copy and process
    df_out = df.copy()
    numeric_cols = df_out.select_dtypes(include=np.number).columns.tolist()
    
    # Only consider a row an outlier if it has outliers in multiple columns
    outlier_count_per_row = pd.Series(0, index=df_out.index)
    
    # Count outliers per column
    column_outlier_counts = {}
    
    for col in numeric_cols:
        # Skip columns with low variance or mostly identical values
        if df_out[col].std() < 0.001 or df_out[col].nunique() < 5:
            continue
            
        # Standard statistical outlier detection
        Q1 = df_out[col].quantile(0.10)  # More conservative percentiles
        Q3 = df_out[col].quantile(0.90)
        IQR = Q3 - Q1
        
        # Very conservative threshold - only mark extreme outliers
        lower = Q1 - 5*IQR
        upper = Q3 + 5*IQR
        
        # Identify outliers in this column
        col_outliers = (df_out[col] < lower) | (df_out[col] > upper)
        column_outlier_counts[col] = col_outliers.sum()
        
        # Increment outlier count for affected rows
        outlier_count_per_row += col_outliers
    
    # Only remove rows that are outliers in at least 3 different columns
    # This focuses on truly problematic data points
    outlier_rows = outlier_count_per_row >= 3
    
    # Get indices of rows to keep
    keep_indices = df_out.index[~outlier_rows]
    
    # Drop the identified outlier rows
    df_out = df_out.loc[keep_indices]
    
    # If labels are provided, filter them too
    if y is not None:
        y_out = y.loc[keep_indices]
    else:
        y_out = None
    
    # Track final shape
    final_shape = df_out.shape
    print(f"Final shape: {final_shape}")
    print(f"Rows maintained: {final_shape[0]} ({(final_shape[0]/initial_shape[0])*100:.1f}%)")
    print(f"Outlier rows removed: {initial_shape[0] - final_shape[0]}")
    
    return df_out, y_out

# Apply to datasets with names and get updated labels
X_train_out, y_train_out = handle_outliers(X_train, y_train, "Training")
X_val_out, y_val_out = handle_outliers(X_val, y_val, "Validation")
X_test_out, y_test_out = handle_outliers(X_test, y_test, "Test")
Processing Training dataset:
Initial shape: (157035, 420)
Final shape: (155305, 420)
Rows maintained: 155305 (98.9%)
Outlier rows removed: 1730

Processing Validation dataset:
Initial shape: (39259, 420)
Final shape: (38829, 420)
Rows maintained: 38829 (98.9%)
Outlier rows removed: 430

Processing Test dataset:
Initial shape: (96256, 420)
Final shape: (95180, 420)
Rows maintained: 95180 (98.9%)
Outlier rows removed: 1076

4.8 Handle Skewness¶

InĀ [68]:
def handle_skewness(df, threshold=.5):
    """Handle skewness with detailed before/after reporting"""
    numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
    skewed_features = {}
    transformation_report = []
    
    # Initial skewness analysis
    print(f"\nSkewness analysis (threshold: {threshold}):")
    for col in numeric_cols:
        skew = df[col].skew()
        if abs(skew) > threshold:
            skewed_features[col] = {
                'initial': skew,
                'final': None,
                'transformed': False
            }

    print(f"Found {len(skewed_features)} potentially skewed features")
    
    # Apply transformations
    df_transformed = df.copy()
    for col, stats in skewed_features.items():
        original_skew = stats['initial']
        
        # Check if transformation needed
        if abs(original_skew) > threshold:
            # Handle non-positive values
            if df[col].min() <= 0:
                shift = abs(df[col].min()) + 1
                transformed = np.log1p(df[col] + shift)
            else:
                transformed = np.log1p(df[col])
            
            # Calculate new skewness
            new_skew = transformed.skew()
            
            # Only apply if improvement occurs
            if abs(new_skew) < abs(original_skew):
                df_transformed[col] = transformed
                stats['final'] = new_skew
                stats['transformed'] = True
                transformation_report.append(
                    f"{col}: {original_skew:.2f} → {new_skew:.2f} (Improved)"
                )
            else:
                transformation_report.append(
                    f"{col}: {original_skew:.2f} → {new_skew:.2f} (No change)"
                )
                stats['final'] = original_skew
                stats['transformed'] = False

    # Print transformation results
    print("\nSkewness transformation results:")
    for report in transformation_report:
        print(report)
        
    return df_transformed, skewed_features

# Apply to processed datasets
print("\n=== Training Data ===")
X_train_skew, train_skew_info = handle_skewness(X_train_out, threshold=.5)

print("\n=== Validation Data ===")
X_val_skew, val_skew_info = handle_skewness(X_val_out, threshold=.5)

print("\n=== Test Data ===")
X_test_skew, test_skew_info = handle_skewness(X_test_out, threshold=.5)
=== Training Data ===

Skewness analysis (threshold: 0.5):
Found 396 potentially skewed features

Skewness transformation results:
industry_code: 0.50 → 0.23 (Improved)
occupation_code: 0.81 → 0.36 (Improved)
wage_per_hour: 9.43 → 3.93 (Improved)
capital_gains: 26.80 → 5.99 (Improved)
capital_losses: 7.51 → 6.85 (Improved)
dividends_from_stocks: 29.97 → 3.36 (Improved)
instance_weight: 1.40 → -0.79 (Improved)
own_business_or_self_employed: 2.89 → 2.85 (Improved)
veterans_benefits: -1.26 → -1.27 (No change)
work_experience: 0.77 → -0.33 (Improved)
capital_ratio: 26.80 → 5.99 (Improved)
full_year_worker: 0.54 → 0.54 (No change)
has_capital_gains: 9.71 → 9.71 (No change)
has_capital_losses: 6.83 → 6.83 (Improved)
has_dividends: 2.68 → 2.68 (No change)
is_married: -2.14 → -2.14 (Improved)
enrolled_in_edu_inst_encoded: 4.20 → 3.98 (Improved)
member_of_labor_union_encoded: 3.43 → 3.13 (Improved)
fill_inc_questionnaire_for_veteran_encoded: 11.65 → 10.83 (Improved)
class_of_worker_ Federal government: 8.02 → 8.02 (Improved)
class_of_worker_ Local government: 4.74 → 4.74 (No change)
class_of_worker_ Never worked: 20.93 → 20.93 (Improved)
class_of_worker_ Private: 0.56 → 0.56 (No change)
class_of_worker_ Self-employed-incorporated: 7.77 → 7.77 (No change)
class_of_worker_ Self-employed-not incorporated: 4.52 → 4.52 (Improved)
class_of_worker_ State government: 6.60 → 6.60 (No change)
class_of_worker_ Without pay: 34.00 → 34.00 (No change)
education_ 10th grade: 4.78 → 4.78 (No change)
education_ 11th grade: 5.05 → 5.05 (No change)
education_ 12th grade no diploma: 9.46 → 9.46 (No change)
education_ 1st 2nd 3rd or 4th grade: 10.39 → 10.39 (Improved)
education_ 5th or 6th grade: 7.48 → 7.48 (Improved)
education_ 7th and 8th grade: 4.63 → 4.63 (No change)
education_ 9th grade: 5.36 → 5.36 (No change)
education_ Associates degree-academic program: 6.48 → 6.48 (Improved)
education_ Associates degree-occup /vocational: 5.83 → 5.83 (No change)
education_ Bachelors degree(BA AB BS): 2.68 → 2.68 (No change)
education_ Children: 1.29 → 1.29 (No change)
education_ Doctorate degree(PhD EdD): 12.59 → 12.59 (No change)
education_ High school graduate: 1.17 → 1.17 (No change)
education_ Less than 1st grade: 15.24 → 15.24 (Improved)
education_ Masters degree(MA MS MEng MEd MSW MBA): 5.30 → 5.30 (Improved)
education_ Prof school degree (MD DDS DVM LLB JD): 10.77 → 10.77 (Improved)
education_ Some college but no degree: 2.06 → 2.06 (No change)
enrolled_in_edu_inst_ College or university: 5.58 → 5.58 (No change)
enrolled_in_edu_inst_ High school: 5.06 → 5.06 (No change)
enrolled_in_edu_inst_ Not in universe: -3.55 → -3.55 (Improved)
marital_status_ Divorced: 3.54 → 3.54 (Improved)
marital_status_ Married-A F spouse present: 16.84 → 16.84 (Improved)
marital_status_ Married-spouse absent: 11.29 → 11.29 (No change)
marital_status_ Separated: 7.35 → 7.35 (Improved)
marital_status_ Widowed: 4.00 → 4.00 (No change)
major_industry_code_ Agriculture: 7.90 → 7.90 (Improved)
major_industry_code_ Armed Forces: 73.16 → 73.16 (Improved)
major_industry_code_ Business and repair services: 5.66 → 5.66 (Improved)
major_industry_code_ Communications: 12.93 → 12.93 (Improved)
major_industry_code_ Construction: 5.47 → 5.47 (No change)
major_industry_code_ Education: 4.56 → 4.56 (No change)
major_industry_code_ Entertainment: 10.77 → 10.77 (No change)
major_industry_code_ Finance insurance and real estate: 5.45 → 5.45 (Improved)
major_industry_code_ Forestry and fisheries: 32.68 → 32.68 (No change)
major_industry_code_ Hospital services: 6.86 → 6.86 (No change)
major_industry_code_ Manufacturing-durable goods: 4.38 → 4.38 (Improved)
major_industry_code_ Manufacturing-nondurable goods: 5.10 → 5.10 (Improved)
major_industry_code_ Medical except hospital: 6.29 → 6.29 (Improved)
major_industry_code_ Mining: 18.95 → 18.95 (Improved)
major_industry_code_ Other professional services: 6.44 → 6.44 (No change)
major_industry_code_ Personal services except private HH: 7.99 → 7.99 (Improved)
major_industry_code_ Private household services: 14.30 → 14.30 (Improved)
major_industry_code_ Public administration: 6.29 → 6.29 (Improved)
major_industry_code_ Retail trade: 2.92 → 2.92 (No change)
major_industry_code_ Social services: 8.58 → 8.58 (Improved)
major_industry_code_ Transportation: 6.61 → 6.61 (Improved)
major_industry_code_ Utilities and sanitary services: 12.91 → 12.91 (No change)
major_industry_code_ Wholesale trade: 7.26 → 7.26 (Improved)
major_occupation_code_ Adm support including clerical: 3.21 → 3.21 (Improved)
major_occupation_code_ Armed Forces: 73.16 → 73.16 (Improved)
major_occupation_code_ Executive admin and managerial: 3.63 → 3.63 (No change)
major_occupation_code_ Farming forestry and fishing: 7.75 → 7.75 (Improved)
major_occupation_code_ Handlers equip cleaners etc : 6.61 → 6.61 (Improved)
major_occupation_code_ Machine operators assmblrs & inspctrs: 5.29 → 5.29 (No change)
major_occupation_code_ Other service: 3.63 → 3.63 (Improved)
major_occupation_code_ Precision production craft & repair: 4.00 → 4.00 (No change)
major_occupation_code_ Private household services: 15.76 → 15.76 (No change)
major_occupation_code_ Professional specialty: 3.38 → 3.38 (Improved)
major_occupation_code_ Protective services: 10.60 → 10.60 (No change)
major_occupation_code_ Sales: 3.72 → 3.72 (Improved)
major_occupation_code_ Technicians and related support: 7.96 → 7.96 (Improved)
major_occupation_code_ Transportation and material moving: 6.78 → 6.78 (No change)
race_ Amer Indian Aleut or Eskimo: 9.23 → 9.23 (Improved)
race_ Asian or Pacific Islander: 5.54 → 5.54 (No change)
race_ Black: 2.59 → 2.59 (Improved)
race_ Other: 7.07 → 7.07 (Improved)
race_ White: -1.81 → -1.81 (Improved)
hispanic_origin_ All other: -2.06 → -2.06 (Improved)
hispanic_origin_ Central or South American: 6.89 → 6.89 (Improved)
hispanic_origin_ Chicano: 24.72 → 24.72 (No change)
hispanic_origin_ Cuban: 13.17 → 13.17 (Improved)
hispanic_origin_ Do not know: 25.07 → 25.07 (Improved)
hispanic_origin_ Mexican (Mexicano): 4.89 → 4.89 (No change)
hispanic_origin_ Mexican-American: 4.60 → 4.60 (No change)
hispanic_origin_ NA: 15.14 → 15.14 (Improved)
hispanic_origin_ Other Spanish: 8.67 → 8.67 (Improved)
hispanic_origin_ Puerto Rican: 7.49 → 7.49 (Improved)
member_of_labor_union_ No: 3.10 → 3.10 (No change)
member_of_labor_union_ Not in universe: -2.77 → -2.77 (Improved)
member_of_labor_union_ Yes: 8.08 → 8.08 (Improved)
reason_for_unemployment_ Job leaver: 17.92 → 17.92 (Improved)
reason_for_unemployment_ Job loser - on layoff: 14.00 → 14.00 (Improved)
reason_for_unemployment_ New entrant: 20.93 → 20.93 (Improved)
reason_for_unemployment_ Not in universe: -5.40 → -5.40 (Improved)
reason_for_unemployment_ Other job loser: 9.59 → 9.59 (Improved)
reason_for_unemployment_ Re-entrant: 9.72 → 9.72 (No change)
full_or_part_time_employment_ Full-time schedules: 1.46 → 1.46 (No change)
full_or_part_time_employment_ Not in labor force: 2.11 → 2.11 (No change)
full_or_part_time_employment_ PT for econ reasons usually FT: 19.22 → 19.22 (No change)
full_or_part_time_employment_ PT for econ reasons usually PT: 12.76 → 12.76 (Improved)
full_or_part_time_employment_ PT for non-econ reasons usually FT: 7.51 → 7.51 (Improved)
full_or_part_time_employment_ Unemployed full-time: 9.02 → 9.02 (No change)
full_or_part_time_employment_ Unemployed part- time: 15.24 → 15.24 (Improved)
tax_filer_status_ Head of household: 4.83 → 4.83 (Improved)
tax_filer_status_ Joint both 65+: 4.59 → 4.59 (No change)
tax_filer_status_ Joint both under 65: 0.67 → 0.67 (No change)
tax_filer_status_ Joint one under 65 & one 65+: 6.94 → 6.94 (Improved)
tax_filer_status_ Nonfiler: 0.53 → 0.53 (No change)
tax_filer_status_ Single: 1.59 → 1.59 (Improved)
region_of_previous_residence_ Abroad: 19.02 → 19.02 (No change)
region_of_previous_residence_ Midwest: 7.17 → 7.17 (Improved)
region_of_previous_residence_ Northeast: 8.37 → 8.37 (Improved)
region_of_previous_residence_ Not in universe: -3.08 → -3.08 (Improved)
region_of_previous_residence_ South: 6.09 → 6.09 (Improved)
region_of_previous_residence_ West: 6.74 → 6.74 (No change)
state_of_previous_residence_ ?: 16.52 → 16.52 (No change)
state_of_previous_residence_ Abroad: 16.85 → 16.85 (No change)
state_of_previous_residence_ Alabama: 29.57 → 29.57 (No change)
state_of_previous_residence_ Alaska: 26.33 → 26.33 (No change)
state_of_previous_residence_ Arizona: 28.61 → 28.61 (No change)
state_of_previous_residence_ Arkansas: 30.63 → 30.63 (Improved)
state_of_previous_residence_ California: 10.60 → 10.60 (Improved)
state_of_previous_residence_ Colorado: 28.54 → 28.54 (Improved)
state_of_previous_residence_ Connecticut: 40.40 → 40.40 (Improved)
state_of_previous_residence_ Delaware: 53.11 → 53.11 (Improved)
state_of_previous_residence_ District of Columbia: 41.97 → 41.97 (Improved)
state_of_previous_residence_ Florida: 15.14 → 15.14 (Improved)
state_of_previous_residence_ Georgia: 28.92 → 28.92 (No change)
state_of_previous_residence_ Idaho: 82.16 → 82.16 (Improved)
state_of_previous_residence_ Illinois: 32.57 → 32.57 (No change)
state_of_previous_residence_ Indiana: 18.99 → 18.99 (No change)
state_of_previous_residence_ Iowa: 32.02 → 32.02 (Improved)
state_of_previous_residence_ Kansas: 37.20 → 37.20 (No change)
state_of_previous_residence_ Kentucky: 28.17 → 28.17 (Improved)
state_of_previous_residence_ Louisiana: 32.24 → 32.24 (Improved)
state_of_previous_residence_ Maine: 35.49 → 35.49 (No change)
state_of_previous_residence_ Maryland: 38.06 → 38.06 (No change)
state_of_previous_residence_ Massachusetts: 34.93 → 34.93 (Improved)
state_of_previous_residence_ Michigan: 20.53 → 20.53 (Improved)
state_of_previous_residence_ Minnesota: 18.33 → 18.33 (Improved)
state_of_previous_residence_ Mississippi: 30.63 → 30.63 (Improved)
state_of_previous_residence_ Missouri: 32.68 → 32.68 (No change)
state_of_previous_residence_ Montana: 28.54 → 28.54 (Improved)
state_of_previous_residence_ Nebraska: 32.68 → 32.68 (Improved)
state_of_previous_residence_ Nevada: 33.03 → 33.03 (No change)
state_of_previous_residence_ New Hampshire: 28.24 → 28.24 (Improved)
state_of_previous_residence_ New Jersey: 52.63 → 52.63 (No change)
state_of_previous_residence_ New Mexico: 21.02 → 21.02 (Improved)
state_of_previous_residence_ New York: 31.51 → 31.51 (No change)
state_of_previous_residence_ North Carolina: 15.52 → 15.52 (Improved)
state_of_previous_residence_ North Dakota: 19.68 → 19.68 (Improved)
state_of_previous_residence_ Not in universe: -3.08 → -3.08 (Improved)
state_of_previous_residence_ Ohio: 30.91 → 30.91 (No change)
state_of_previous_residence_ Oklahoma: 17.36 → 17.36 (No change)
state_of_previous_residence_ Oregon: 28.69 → 28.69 (No change)
state_of_previous_residence_ Pennsylvania: 31.71 → 31.71 (No change)
state_of_previous_residence_ South Carolina: 46.09 → 46.09 (Improved)
state_of_previous_residence_ South Dakota: 37.71 → 37.71 (Improved)
state_of_previous_residence_ Tennessee: 31.30 → 31.30 (No change)
state_of_previous_residence_ Texas: 31.11 → 31.11 (No change)
state_of_previous_residence_ Utah: 13.45 → 13.45 (Improved)
state_of_previous_residence_ Vermont: 32.57 → 32.57 (No change)
state_of_previous_residence_ Virginia: 39.98 → 39.98 (No change)
state_of_previous_residence_ West Virginia: 29.66 → 29.66 (Improved)
state_of_previous_residence_ Wisconsin: 43.75 → 43.75 (Improved)
state_of_previous_residence_ Wyoming: 28.69 → 28.69 (No change)
detailed_household_summary_ Child 18+ ever marr Not in a subfamily: 13.76 → 13.76 (Improved)
detailed_household_summary_ Child 18+ ever marr RP of subfamily: 17.21 → 17.21 (Improved)
detailed_household_summary_ Child 18+ never marr Not in a subfamily: 3.64 → 3.64 (Improved)
detailed_household_summary_ Child 18+ never marr RP of subfamily: 18.17 → 18.17 (Improved)
detailed_household_summary_ Child 18+ spouse of subfamily RP: 38.24 → 38.24 (No change)
detailed_household_summary_ Child <18 ever marr RP of subfamily: 160.88 → 160.88 (Improved)
detailed_household_summary_ Child <18 ever marr not in subfamily: 84.00 → 84.00 (Improved)
detailed_household_summary_ Child <18 never marr RP of subfamily: 48.85 → 48.85 (No change)
detailed_household_summary_ Child <18 never marr not in subfamily: 1.20 → 1.20 (No change)
detailed_household_summary_ Child <18 spouse of subfamily RP: 278.66 → 278.66 (Improved)
detailed_household_summary_ Child under 18 of RP of unrel subfamily: 16.15 → 16.15 (Improved)
detailed_household_summary_ Grandchild 18+ ever marr RP of subfamily: 139.32 → 139.32 (Improved)
detailed_household_summary_ Grandchild 18+ ever marr not in subfamily: 75.82 → 75.82 (Improved)
detailed_household_summary_ Grandchild 18+ never marr RP of subfamily: 160.88 → 160.88 (Improved)
detailed_household_summary_ Grandchild 18+ never marr not in subfamily: 22.61 → 22.61 (Improved)
detailed_household_summary_ Grandchild 18+ spouse of subfamily RP: 131.35 → 131.35 (No change)
detailed_household_summary_ Grandchild <18 never marr RP of subfamily: 394.09 → 394.09 (No change)
detailed_household_summary_ Grandchild <18 never marr child of subfamily RP: 9.99 → 9.99 (No change)
detailed_household_summary_ Grandchild <18 never marr not in subfamily: 13.58 → 13.58 (Improved)
detailed_household_summary_ Householder: 1.06 → 1.06 (No change)
detailed_household_summary_ In group quarters: 31.81 → 31.81 (No change)
detailed_household_summary_ Nonfamily householder: 2.46 → 2.46 (Improved)
detailed_household_summary_ Other Rel 18+ ever marr RP of subfamily: 17.03 → 17.03 (No change)
detailed_household_summary_ Other Rel 18+ ever marr not in subfamily: 9.84 → 9.84 (No change)
detailed_household_summary_ Other Rel 18+ never marr RP of subfamily: 46.74 → 46.74 (Improved)
detailed_household_summary_ Other Rel 18+ never marr not in subfamily: 10.45 → 10.45 (Improved)
detailed_household_summary_ Other Rel 18+ spouse of subfamily RP: 17.42 → 17.42 (No change)
detailed_household_summary_ Other Rel <18 ever marr RP of subfamily: 160.88 → 160.88 (Improved)
detailed_household_summary_ Other Rel <18 ever marr not in subfamily: 394.09 → 394.09 (No change)
detailed_household_summary_ Other Rel <18 never marr child of subfamily RP: 17.15 → 17.15 (Improved)
detailed_household_summary_ Other Rel <18 never marr not in subfamily: 17.85 → 17.85 (Improved)
detailed_household_summary_ Other Rel <18 never married RP of subfamily: 197.04 → 197.04 (No change)
detailed_household_summary_ Other Rel <18 spouse of subfamily RP: 278.66 → 278.66 (Improved)
detailed_household_summary_ RP of unrelated subfamily: 16.40 → 16.40 (Improved)
detailed_household_summary_ Secondary individual: 5.38 → 5.38 (Improved)
detailed_household_summary_ Spouse of RP of unrelated subfamily: 58.72 → 58.72 (Improved)
detailed_household_summary_ Spouse of householder: 1.39 → 1.39 (No change)
detailed_household_summary_in_household_ Child 18 or older: 3.26 → 3.26 (No change)
detailed_household_summary_in_household_ Child under 18 ever married: 71.93 → 71.93 (Improved)
detailed_household_summary_in_household_ Child under 18 never married: 1.19 → 1.19 (No change)
detailed_household_summary_in_household_ Group Quarters- Secondary individual: 38.61 → 38.61 (No change)
detailed_household_summary_in_household_ Householder: 0.50 → 0.50 (Improved)
detailed_household_summary_in_household_ Nonrelative of householder: 4.75 → 4.75 (No change)
detailed_household_summary_in_household_ Other relative of householder: 4.13 → 4.13 (Improved)
detailed_household_summary_in_household_ Spouse of householder: 1.39 → 1.39 (No change)
migration_code_change_in_msa_ Abroad to MSA: 20.67 → 20.67 (No change)
migration_code_change_in_msa_ Abroad to nonMSA: 49.62 → 49.62 (Improved)
migration_code_change_in_msa_ MSA to MSA: 3.94 → 3.94 (No change)
migration_code_change_in_msa_ MSA to nonMSA: 15.56 → 15.56 (Improved)
migration_code_change_in_msa_ NonMSA to MSA: 18.08 → 18.08 (Improved)
migration_code_change_in_msa_ NonMSA to nonMSA: 8.16 → 8.16 (Improved)
migration_code_change_in_msa_ Not identifiable: 21.33 → 21.33 (No change)
migration_code_change_in_msa_ Not in universe: 11.43 → 11.43 (Improved)
migration_code_change_in_reg_ Abroad: 19.02 → 19.02 (No change)
migration_code_change_in_reg_ Different county same state: 8.12 → 8.12 (No change)
migration_code_change_in_reg_ Different division same region: 21.12 → 21.12 (No change)
migration_code_change_in_reg_ Different region: 12.84 → 12.84 (No change)
migration_code_change_in_reg_ Different state same division: 14.00 → 14.00 (Improved)
migration_code_change_in_reg_ Not in universe: 11.43 → 11.43 (Improved)
migration_code_change_in_reg_ Same county: 4.13 → 4.13 (No change)
migration_code_move_within_reg_ Abroad: 19.02 → 19.02 (No change)
migration_code_move_within_reg_ Different county same state: 8.12 → 8.12 (No change)
migration_code_move_within_reg_ Different state in Midwest: 18.79 → 18.79 (Improved)
migration_code_move_within_reg_ Different state in Northeast: 21.33 → 21.33 (No change)
migration_code_move_within_reg_ Different state in South: 14.20 → 14.20 (Improved)
migration_code_move_within_reg_ Different state in West: 17.25 → 17.25 (No change)
migration_code_move_within_reg_ Not in universe: 11.43 → 11.43 (Improved)
migration_code_move_within_reg_ Same county: 4.13 → 4.13 (No change)
live_in_this_house_1_year_ago_ No: 3.08 → 3.08 (Improved)
migration_prev_res_in_sunbelt_ No: 4.09 → 4.09 (Improved)
migration_prev_res_in_sunbelt_ Yes: 5.55 → 5.55 (No change)
family_members_under_18_ Both parents present: 1.62 → 1.62 (Improved)
family_members_under_18_ Father only present: 10.04 → 10.04 (Improved)
family_members_under_18_ Mother only present: 3.53 → 3.53 (Improved)
family_members_under_18_ Neither parent present: 10.73 → 10.73 (Improved)
family_members_under_18_ Not in universe: -1.04 → -1.04 (Improved)
country_of_birth_father_ ?: 5.15 → 5.15 (No change)
country_of_birth_father_ Cambodia: 31.81 → 31.81 (No change)
country_of_birth_father_ Canada: 11.88 → 11.88 (Improved)
country_of_birth_father_ China: 14.90 → 14.90 (Improved)
country_of_birth_father_ Columbia: 17.83 → 17.83 (No change)
country_of_birth_father_ Cuba: 13.10 → 13.10 (Improved)
country_of_birth_father_ Dominican-Republic: 12.12 → 12.12 (Improved)
country_of_birth_father_ Ecuador: 22.84 → 22.84 (Improved)
country_of_birth_father_ El-Salvador: 13.87 → 13.87 (No change)
country_of_birth_father_ England: 16.15 → 16.15 (Improved)
country_of_birth_father_ France: 32.13 → 32.13 (Improved)
country_of_birth_father_ Germany: 11.83 → 11.83 (Improved)
country_of_birth_father_ Greece: 24.06 → 24.06 (Improved)
country_of_birth_father_ Guatemala: 21.21 → 21.21 (Improved)
country_of_birth_father_ Haiti: 23.70 → 23.70 (Improved)
country_of_birth_father_ Holand-Netherlands: 63.08 → 63.08 (Improved)
country_of_birth_father_ Honduras: 31.51 → 31.51 (No change)
country_of_birth_father_ Hong Kong: 42.71 → 42.71 (Improved)
country_of_birth_father_ Hungary: 25.98 → 25.98 (No change)
country_of_birth_father_ India: 18.48 → 18.48 (Improved)
country_of_birth_father_ Iran: 29.00 → 29.00 (No change)
country_of_birth_father_ Ireland: 19.27 → 19.27 (Improved)
country_of_birth_father_ Italy: 9.25 → 9.25 (Improved)
country_of_birth_father_ Jamaica: 20.44 → 20.44 (No change)
country_of_birth_father_ Japan: 22.39 → 22.39 (No change)
country_of_birth_father_ Laos: 36.39 → 36.39 (No change)
country_of_birth_father_ Mexico: 4.07 → 4.07 (Improved)
country_of_birth_father_ Nicaragua: 25.65 → 25.65 (No change)
country_of_birth_father_ Outlying-U S (Guam USVI etc): 33.26 → 33.26 (No change)
country_of_birth_father_ Panama: 90.39 → 90.39 (Improved)
country_of_birth_father_ Peru: 23.57 → 23.57 (Improved)
country_of_birth_father_ Philippines: 12.99 → 12.99 (Improved)
country_of_birth_father_ Poland: 12.72 → 12.72 (Improved)
country_of_birth_father_ Portugal: 22.43 → 22.43 (No change)
country_of_birth_father_ Puerto-Rico: 8.38 → 8.38 (No change)
country_of_birth_father_ Scotland: 28.39 → 28.39 (Improved)
country_of_birth_father_ South Korea: 19.11 → 19.11 (Improved)
country_of_birth_father_ Taiwan: 31.30 → 31.30 (No change)
country_of_birth_father_ Thailand: 42.22 → 42.22 (Improved)
country_of_birth_father_ Trinadad&Tobago: 40.83 → 40.83 (No change)
country_of_birth_father_ United-States: -1.46 → -1.46 (No change)
country_of_birth_father_ Vietnam: 20.61 → 20.61 (Improved)
country_of_birth_father_ Yugoslavia: 28.92 → 28.92 (No change)
country_of_birth_mother_ ?: 5.42 → 5.42 (Improved)
country_of_birth_mother_ Cambodia: 35.78 → 35.78 (No change)
country_of_birth_mother_ Canada: 11.61 → 11.61 (Improved)
country_of_birth_mother_ China: 15.78 → 15.78 (No change)
country_of_birth_mother_ Columbia: 17.81 → 17.81 (No change)
country_of_birth_mother_ Cuba: 13.19 → 13.19 (Improved)
country_of_birth_mother_ Dominican-Republic: 13.07 → 13.07 (Improved)
country_of_birth_mother_ Ecuador: 23.24 → 23.24 (Improved)
country_of_birth_mother_ El-Salvador: 13.13 → 13.13 (Improved)
country_of_birth_mother_ England: 15.04 → 15.04 (Improved)
country_of_birth_mother_ France: 30.63 → 30.63 (Improved)
country_of_birth_mother_ Germany: 11.72 → 11.72 (No change)
country_of_birth_mother_ Greece: 27.81 → 27.81 (No change)
country_of_birth_mother_ Guatemala: 21.15 → 21.15 (No change)
country_of_birth_mother_ Haiti: 23.74 → 23.74 (No change)
country_of_birth_mother_ Holand-Netherlands: 63.08 → 63.08 (Improved)
country_of_birth_mother_ Honduras: 29.16 → 29.16 (Improved)
country_of_birth_mother_ Hong Kong: 42.22 → 42.22 (No change)
country_of_birth_mother_ Hungary: 26.33 → 26.33 (No change)
country_of_birth_mother_ India: 18.39 → 18.39 (Improved)
country_of_birth_mother_ Iran: 31.61 → 31.61 (Improved)
country_of_birth_mother_ Ireland: 17.66 → 17.66 (Improved)
country_of_birth_mother_ Italy: 10.18 → 10.18 (No change)
country_of_birth_mother_ Jamaica: 20.67 → 20.67 (No change)
country_of_birth_mother_ Japan: 20.41 → 20.41 (Improved)
country_of_birth_mother_ Laos: 35.93 → 35.93 (No change)
country_of_birth_mother_ Mexico: 4.12 → 4.12 (Improved)
country_of_birth_mother_ Nicaragua: 26.22 → 26.22 (No change)
country_of_birth_mother_ Outlying-U S (Guam USVI etc): 34.00 → 34.00 (No change)
country_of_birth_mother_ Panama: 78.80 → 78.80 (Improved)
country_of_birth_mother_ Peru: 23.00 → 23.00 (No change)
country_of_birth_mother_ Philippines: 12.59 → 12.59 (Improved)
country_of_birth_mother_ Poland: 13.28 → 13.28 (No change)
country_of_birth_mother_ Portugal: 23.83 → 23.83 (Improved)
country_of_birth_mother_ Puerto-Rico: 8.74 → 8.74 (Improved)
country_of_birth_mother_ Scotland: 27.88 → 27.88 (Improved)
country_of_birth_mother_ South Korea: 17.76 → 17.76 (Improved)
country_of_birth_mother_ Taiwan: 29.40 → 29.40 (Improved)
country_of_birth_mother_ Thailand: 39.18 → 39.18 (Improved)
country_of_birth_mother_ Trinadad&Tobago: 43.49 → 43.49 (Improved)
country_of_birth_mother_ United-States: -1.51 → -1.51 (Improved)
country_of_birth_mother_ Vietnam: 20.30 → 20.30 (No change)
country_of_birth_mother_ Yugoslavia: 32.35 → 32.35 (No change)
country_of_birth_self_ ?: 7.41 → 7.41 (Improved)
country_of_birth_self_ Cambodia: 44.88 → 44.88 (Improved)
country_of_birth_self_ Canada: 16.87 → 16.87 (Improved)
country_of_birth_self_ China: 19.85 → 19.85 (No change)
country_of_birth_self_ Columbia: 21.30 → 21.30 (No change)
country_of_birth_self_ Cuba: 15.21 → 15.21 (Improved)
country_of_birth_self_ Dominican-Republic: 16.59 → 16.59 (Improved)
country_of_birth_self_ Ecuador: 27.95 → 27.95 (Improved)
country_of_birth_self_ El-Salvador: 16.85 → 16.85 (No change)
country_of_birth_self_ England: 21.43 → 21.43 (Improved)
country_of_birth_self_ France: 40.40 → 40.40 (No change)
country_of_birth_self_ Germany: 15.11 → 15.11 (No change)
country_of_birth_self_ Greece: 37.37 → 37.37 (No change)
country_of_birth_self_ Guatemala: 23.92 → 23.92 (Improved)
country_of_birth_self_ Haiti: 29.83 → 29.83 (No change)
country_of_birth_self_ Holand-Netherlands: 101.74 → 101.74 (Improved)
country_of_birth_self_ Honduras: 36.08 → 36.08 (No change)
country_of_birth_self_ Hong Kong: 43.22 → 43.22 (Improved)
country_of_birth_self_ Hungary: 50.85 → 50.85 (Improved)
country_of_birth_self_ India: 22.24 → 22.24 (Improved)
country_of_birth_self_ Iran: 35.49 → 35.49 (No change)
country_of_birth_self_ Ireland: 38.06 → 38.06 (No change)
country_of_birth_self_ Italy: 21.43 → 21.43 (Improved)
country_of_birth_self_ Jamaica: 24.52 → 24.52 (No change)
country_of_birth_self_ Japan: 24.38 → 24.38 (Improved)
country_of_birth_self_ Laos: 42.96 → 42.96 (No change)
country_of_birth_self_ Mexico: 5.54 → 5.54 (Improved)
country_of_birth_self_ Nicaragua: 31.21 → 31.21 (No change)
country_of_birth_self_ Outlying-U S (Guam USVI etc): 39.57 → 39.57 (No change)
country_of_birth_self_ Panama: 84.00 → 84.00 (No change)
country_of_birth_self_ Peru: 26.22 → 26.22 (No change)
country_of_birth_self_ Philippines: 15.34 → 15.34 (Improved)
country_of_birth_self_ Poland: 23.08 → 23.08 (Improved)
country_of_birth_self_ Portugal: 32.68 → 32.68 (No change)
country_of_birth_self_ Puerto-Rico: 11.69 → 11.69 (No change)
country_of_birth_self_ Scotland: 50.02 → 50.02 (Improved)
country_of_birth_self_ South Korea: 20.14 → 20.14 (Improved)
country_of_birth_self_ Taiwan: 30.82 → 30.82 (Improved)
country_of_birth_self_ Thailand: 41.74 → 41.74 (Improved)
country_of_birth_self_ Trinadad&Tobago: 52.63 → 52.63 (No change)
country_of_birth_self_ United-States: -2.41 → -2.41 (Improved)
country_of_birth_self_ Vietnam: 22.03 → 22.03 (Improved)
country_of_birth_self_ Yugoslavia: 52.63 → 52.63 (Improved)
citizenship_ Foreign born- Not a citizen of U S : 3.41 → 3.41 (No change)
citizenship_ Foreign born- U S citizen by naturalization: 5.54 → 5.54 (No change)
citizenship_ Native- Born abroad of American Parent(s): 10.45 → 10.45 (Improved)
citizenship_ Native- Born in Puerto Rico or U S Outlying: 11.19 → 11.19 (Improved)
citizenship_ Native- Born in the United States: -2.41 → -2.41 (Improved)
fill_inc_questionnaire_for_veteran_ No: 10.98 → 10.98 (No change)
fill_inc_questionnaire_for_veteran_ Not in universe: -9.77 → -9.77 (Improved)
fill_inc_questionnaire_for_veteran_ Yes: 21.93 → 21.93 (Improved)

=== Validation Data ===

Skewness analysis (threshold: 0.5):
Found 389 potentially skewed features

Skewness transformation results:
occupation_code: 0.80 → 0.34 (Improved)
wage_per_hour: 8.49 → 3.91 (Improved)
capital_gains: 24.14 → 6.00 (Improved)
capital_losses: 7.71 → 6.99 (Improved)
dividends_from_stocks: 29.19 → 3.33 (Improved)
instance_weight: 1.58 → -0.82 (Improved)
own_business_or_self_employed: 2.89 → 2.85 (Improved)
veterans_benefits: -1.27 → -1.28 (No change)
work_experience: 0.77 → -0.34 (Improved)
capital_ratio: 24.14 → 6.00 (Improved)
full_year_worker: 0.53 → 0.53 (No change)
has_capital_gains: 9.34 → 9.34 (Improved)
has_capital_losses: 6.98 → 6.98 (Improved)
has_dividends: 2.65 → 2.65 (No change)
is_married: -2.14 → -2.14 (Improved)
enrolled_in_edu_inst_encoded: 4.19 → 3.96 (Improved)
member_of_labor_union_encoded: 3.41 → 3.11 (Improved)
fill_inc_questionnaire_for_veteran_encoded: 12.01 → 11.16 (Improved)
class_of_worker_ Federal government: 8.30 → 8.30 (No change)
class_of_worker_ Local government: 4.69 → 4.69 (No change)
class_of_worker_ Never worked: 21.18 → 21.18 (Improved)
class_of_worker_ Private: 0.56 → 0.56 (No change)
class_of_worker_ Self-employed-incorporated: 7.52 → 7.52 (Improved)
class_of_worker_ Self-employed-not incorporated: 4.49 → 4.49 (No change)
class_of_worker_ State government: 6.59 → 6.59 (Improved)
class_of_worker_ Without pay: 35.35 → 35.35 (Improved)
education_ 10th grade: 4.83 → 4.83 (No change)
education_ 11th grade: 5.02 → 5.02 (No change)
education_ 12th grade no diploma: 9.26 → 9.26 (Improved)
education_ 1st 2nd 3rd or 4th grade: 9.73 → 9.73 (No change)
education_ 5th or 6th grade: 7.63 → 7.63 (Improved)
education_ 7th and 8th grade: 4.65 → 4.65 (No change)
education_ 9th grade: 5.24 → 5.24 (Improved)
education_ Associates degree-academic program: 6.52 → 6.52 (No change)
education_ Associates degree-occup /vocational: 5.74 → 5.74 (No change)
education_ Bachelors degree(BA AB BS): 2.69 → 2.69 (No change)
education_ Children: 1.30 → 1.30 (Improved)
education_ Doctorate degree(PhD EdD): 13.05 → 13.05 (Improved)
education_ High school graduate: 1.18 → 1.18 (No change)
education_ Less than 1st grade: 15.63 → 15.63 (Improved)
education_ Masters degree(MA MS MEng MEd MSW MBA): 5.26 → 5.26 (Improved)
education_ Prof school degree (MD DDS DVM LLB JD): 10.69 → 10.69 (No change)
education_ Some college but no degree: 2.05 → 2.05 (No change)
enrolled_in_edu_inst_ College or university: 5.64 → 5.64 (No change)
enrolled_in_edu_inst_ High school: 4.96 → 4.96 (No change)
enrolled_in_edu_inst_ Not in universe: -3.53 → -3.53 (Improved)
marital_status_ Divorced: 3.59 → 3.59 (No change)
marital_status_ Married-A F spouse present: 17.83 → 17.83 (No change)
marital_status_ Married-spouse absent: 11.13 → 11.13 (Improved)
marital_status_ Separated: 7.21 → 7.21 (Improved)
marital_status_ Widowed: 3.99 → 3.99 (No change)
major_industry_code_ Agriculture: 7.69 → 7.69 (Improved)
major_industry_code_ Armed Forces: 80.43 → 80.43 (Improved)
major_industry_code_ Business and repair services: 5.63 → 5.63 (Improved)
major_industry_code_ Communications: 13.33 → 13.33 (Improved)
major_industry_code_ Construction: 5.46 → 5.46 (No change)
major_industry_code_ Education: 4.56 → 4.56 (No change)
major_industry_code_ Entertainment: 10.79 → 10.79 (Improved)
major_industry_code_ Finance insurance and real estate: 5.34 → 5.34 (Improved)
major_industry_code_ Forestry and fisheries: 32.80 → 32.80 (No change)
major_industry_code_ Hospital services: 6.75 → 6.75 (Improved)
major_industry_code_ Manufacturing-durable goods: 4.35 → 4.35 (No change)
major_industry_code_ Manufacturing-nondurable goods: 4.97 → 4.97 (No change)
major_industry_code_ Medical except hospital: 6.18 → 6.18 (No change)
major_industry_code_ Mining: 18.06 → 18.06 (Improved)
major_industry_code_ Other professional services: 6.54 → 6.54 (No change)
major_industry_code_ Personal services except private HH: 7.92 → 7.92 (Improved)
major_industry_code_ Private household services: 14.08 → 14.08 (Improved)
major_industry_code_ Public administration: 6.49 → 6.49 (No change)
major_industry_code_ Retail trade: 2.96 → 2.96 (Improved)
major_industry_code_ Social services: 8.69 → 8.69 (Improved)
major_industry_code_ Transportation: 6.73 → 6.73 (No change)
major_industry_code_ Utilities and sanitary services: 12.82 → 12.82 (No change)
major_industry_code_ Wholesale trade: 7.05 → 7.05 (Improved)
major_occupation_code_ Adm support including clerical: 3.19 → 3.19 (No change)
major_occupation_code_ Armed Forces: 80.43 → 80.43 (Improved)
major_occupation_code_ Executive admin and managerial: 3.63 → 3.63 (No change)
major_occupation_code_ Farming forestry and fishing: 7.50 → 7.50 (Improved)
major_occupation_code_ Handlers equip cleaners etc : 6.90 → 6.90 (Improved)
major_occupation_code_ Machine operators assmblrs & inspctrs: 5.19 → 5.19 (Improved)
major_occupation_code_ Other service: 3.64 → 3.64 (No change)
major_occupation_code_ Precision production craft & repair: 3.88 → 3.88 (Improved)
major_occupation_code_ Private household services: 15.48 → 15.48 (Improved)
major_occupation_code_ Professional specialty: 3.39 → 3.39 (No change)
major_occupation_code_ Protective services: 11.50 → 11.50 (No change)
major_occupation_code_ Sales: 3.72 → 3.72 (Improved)
major_occupation_code_ Technicians and related support: 7.69 → 7.69 (No change)
major_occupation_code_ Transportation and material moving: 6.81 → 6.81 (Improved)
race_ Amer Indian Aleut or Eskimo: 8.90 → 8.90 (No change)
race_ Asian or Pacific Islander: 5.54 → 5.54 (Improved)
race_ Black: 2.65 → 2.65 (No change)
race_ Other: 7.22 → 7.22 (No change)
race_ White: -1.85 → -1.85 (Improved)
hispanic_origin_ All other: -2.06 → -2.06 (Improved)
hispanic_origin_ Central or South American: 6.73 → 6.73 (Improved)
hispanic_origin_ Chicano: 28.39 → 28.39 (Improved)
hispanic_origin_ Cuban: 12.68 → 12.68 (Improved)
hispanic_origin_ Do not know: 26.28 → 26.28 (Improved)
hispanic_origin_ Mexican (Mexicano): 4.92 → 4.92 (No change)
hispanic_origin_ Mexican-American: 4.69 → 4.69 (Improved)
hispanic_origin_ NA: 14.23 → 14.23 (Improved)
hispanic_origin_ Other Spanish: 8.83 → 8.83 (No change)
hispanic_origin_ Puerto Rican: 7.39 → 7.39 (No change)
member_of_labor_union_ No: 3.08 → 3.08 (Improved)
member_of_labor_union_ Not in universe: -2.76 → -2.76 (Improved)
member_of_labor_union_ Yes: 8.06 → 8.06 (Improved)
reason_for_unemployment_ Job leaver: 18.29 → 18.29 (Improved)
reason_for_unemployment_ Job loser - on layoff: 14.01 → 14.01 (Improved)
reason_for_unemployment_ New entrant: 21.18 → 21.18 (Improved)
reason_for_unemployment_ Not in universe: -5.39 → -5.39 (Improved)
reason_for_unemployment_ Other job loser: 9.79 → 9.79 (Improved)
reason_for_unemployment_ Re-entrant: 9.41 → 9.41 (Improved)
full_or_part_time_employment_ Full-time schedules: 1.44 → 1.44 (No change)
full_or_part_time_employment_ Not in labor force: 2.13 → 2.13 (No change)
full_or_part_time_employment_ PT for econ reasons usually FT: 19.25 → 19.25 (Improved)
full_or_part_time_employment_ PT for econ reasons usually PT: 12.42 → 12.42 (Improved)
full_or_part_time_employment_ PT for non-econ reasons usually FT: 7.61 → 7.61 (Improved)
full_or_part_time_employment_ Unemployed full-time: 9.03 → 9.03 (Improved)
full_or_part_time_employment_ Unemployed part- time: 14.75 → 14.75 (Improved)
tax_filer_status_ Head of household: 4.95 → 4.95 (No change)
tax_filer_status_ Joint both 65+: 4.58 → 4.58 (No change)
tax_filer_status_ Joint both under 65: 0.65 → 0.65 (No change)
tax_filer_status_ Joint one under 65 & one 65+: 7.20 → 7.20 (Improved)
tax_filer_status_ Nonfiler: 0.55 → 0.55 (No change)
tax_filer_status_ Single: 1.57 → 1.57 (Improved)
region_of_previous_residence_ Abroad: 19.53 → 19.53 (No change)
region_of_previous_residence_ Midwest: 7.40 → 7.40 (No change)
region_of_previous_residence_ Northeast: 8.24 → 8.24 (No change)
region_of_previous_residence_ Not in universe: -3.08 → -3.08 (Improved)
region_of_previous_residence_ South: 6.08 → 6.08 (No change)
region_of_previous_residence_ West: 6.58 → 6.58 (No change)
state_of_previous_residence_ ?: 16.75 → 16.75 (Improved)
state_of_previous_residence_ Abroad: 17.47 → 17.47 (No change)
state_of_previous_residence_ Alabama: 31.92 → 31.92 (No change)
state_of_previous_residence_ Alaska: 24.38 → 24.38 (Improved)
state_of_previous_residence_ Arizona: 27.01 → 27.01 (No change)
state_of_previous_residence_ Arkansas: 31.92 → 31.92 (No change)
state_of_previous_residence_ California: 10.34 → 10.34 (Improved)
state_of_previous_residence_ Colorado: 28.39 → 28.39 (Improved)
state_of_previous_residence_ Connecticut: 42.97 → 42.97 (Improved)
state_of_previous_residence_ Delaware: 46.41 → 46.41 (Improved)
state_of_previous_residence_ District of Columbia: 40.19 → 40.19 (No change)
state_of_previous_residence_ Florida: 14.93 → 14.93 (Improved)
state_of_previous_residence_ Georgia: 30.36 → 30.36 (Improved)
state_of_previous_residence_ Idaho: 80.43 → 80.43 (Improved)
state_of_previous_residence_ Illinois: 34.26 → 34.26 (No change)
state_of_previous_residence_ Indiana: 19.63 → 19.63 (No change)
state_of_previous_residence_ Iowa: 31.92 → 31.92 (No change)
state_of_previous_residence_ Kansas: 32.80 → 32.80 (No change)
state_of_previous_residence_ Kentucky: 29.00 → 29.00 (Improved)
state_of_previous_residence_ Louisiana: 30.00 → 30.00 (No change)
state_of_previous_residence_ Maine: 30.00 → 30.00 (No change)
state_of_previous_residence_ Maryland: 38.61 → 38.61 (Improved)
state_of_previous_residence_ Massachusetts: 40.19 → 40.19 (Improved)
state_of_previous_residence_ Michigan: 23.16 → 23.16 (Improved)
state_of_previous_residence_ Minnesota: 18.88 → 18.88 (Improved)
state_of_previous_residence_ Mississippi: 32.80 → 32.80 (No change)
state_of_previous_residence_ Missouri: 37.20 → 37.20 (Improved)
state_of_previous_residence_ Montana: 33.75 → 33.75 (No change)
state_of_previous_residence_ Nebraska: 35.94 → 35.94 (No change)
state_of_previous_residence_ Nevada: 34.79 → 34.79 (No change)
state_of_previous_residence_ New Hampshire: 29.32 → 29.32 (No change)
state_of_previous_residence_ New Jersey: 46.41 → 46.41 (Improved)
state_of_previous_residence_ New Mexico: 18.88 → 18.88 (Improved)
state_of_previous_residence_ New York: 31.51 → 31.51 (Improved)
state_of_previous_residence_ North Carolina: 15.20 → 15.20 (No change)
state_of_previous_residence_ North Dakota: 20.04 → 20.04 (Improved)
state_of_previous_residence_ Not in universe: -3.08 → -3.08 (Improved)
state_of_previous_residence_ Ohio: 28.69 → 28.69 (Improved)
state_of_previous_residence_ Oklahoma: 19.06 → 19.06 (No change)
state_of_previous_residence_ Oregon: 29.00 → 29.00 (Improved)
state_of_previous_residence_ Pennsylvania: 29.66 → 29.66 (No change)
state_of_previous_residence_ South Carolina: 41.98 → 41.98 (No change)
state_of_previous_residence_ South Dakota: 36.55 → 36.55 (No change)
state_of_previous_residence_ Tennessee: 30.36 → 30.36 (No change)
state_of_previous_residence_ Texas: 29.32 → 29.32 (No change)
state_of_previous_residence_ Utah: 13.39 → 13.39 (No change)
state_of_previous_residence_ Vermont: 29.66 → 29.66 (No change)
state_of_previous_residence_ Virginia: 37.20 → 37.20 (Improved)
state_of_previous_residence_ West Virginia: 26.51 → 26.51 (No change)
state_of_previous_residence_ Wisconsin: 40.19 → 40.19 (No change)
state_of_previous_residence_ Wyoming: 28.69 → 28.69 (Improved)
detailed_household_summary_ Child 18+ ever marr Not in a subfamily: 13.97 → 13.97 (No change)
detailed_household_summary_ Child 18+ ever marr RP of subfamily: 16.11 → 16.11 (Improved)
detailed_household_summary_ Child 18+ never marr Not in a subfamily: 3.64 → 3.64 (No change)
detailed_household_summary_ Child 18+ never marr RP of subfamily: 17.76 → 17.76 (Improved)
detailed_household_summary_ Child 18+ spouse of subfamily RP: 44.03 → 44.03 (No change)
detailed_household_summary_ Child <18 ever marr RP of subfamily: 113.76 → 113.76 (No change)
detailed_household_summary_ Child <18 ever marr not in subfamily: 52.64 → 52.64 (Improved)
detailed_household_summary_ Child <18 never marr RP of subfamily: 50.85 → 50.85 (Improved)
detailed_household_summary_ Child <18 never marr not in subfamily: 1.19 → 1.19 (No change)
detailed_household_summary_ Child under 18 of RP of unrel subfamily: 16.56 → 16.56 (Improved)
detailed_household_summary_ Grandchild 18+ ever marr RP of subfamily: 197.05 → 197.05 (No change)
detailed_household_summary_ Grandchild 18+ ever marr not in subfamily: 74.46 → 74.46 (Improved)
detailed_household_summary_ Grandchild 18+ never marr not in subfamily: 23.00 → 23.00 (No change)
detailed_household_summary_ Grandchild 18+ spouse of subfamily RP: 197.05 → 197.05 (No change)
detailed_household_summary_ Grandchild <18 never marr RP of subfamily: 197.05 → 197.05 (No change)
detailed_household_summary_ Grandchild <18 never marr child of subfamily RP: 10.74 → 10.74 (Improved)
detailed_household_summary_ Grandchild <18 never marr not in subfamily: 12.99 → 12.99 (No change)
detailed_household_summary_ Householder: 1.05 → 1.05 (No change)
detailed_household_summary_ In group quarters: 31.11 → 31.11 (Improved)
detailed_household_summary_ Nonfamily householder: 2.48 → 2.48 (No change)
detailed_household_summary_ Other Rel 18+ ever marr RP of subfamily: 18.14 → 18.14 (Improved)
detailed_household_summary_ Other Rel 18+ ever marr not in subfamily: 9.78 → 9.78 (Improved)
detailed_household_summary_ Other Rel 18+ never marr RP of subfamily: 41.05 → 41.05 (No change)
detailed_household_summary_ Other Rel 18+ never marr not in subfamily: 10.61 → 10.61 (Improved)
detailed_household_summary_ Other Rel 18+ spouse of subfamily RP: 17.13 → 17.13 (No change)
detailed_household_summary_ Other Rel <18 never marr child of subfamily RP: 17.20 → 17.20 (No change)
detailed_household_summary_ Other Rel <18 never marr not in subfamily: 19.93 → 19.93 (Improved)
detailed_household_summary_ Other Rel <18 spouse of subfamily RP: 197.05 → 197.05 (No change)
detailed_household_summary_ RP of unrelated subfamily: 18.54 → 18.54 (Improved)
detailed_household_summary_ Secondary individual: 5.45 → 5.45 (No change)
detailed_household_summary_ Spouse of RP of unrelated subfamily: 74.46 → 74.46 (Improved)
detailed_household_summary_ Spouse of householder: 1.38 → 1.38 (No change)
detailed_household_summary_in_household_ Child 18 or older: 3.25 → 3.25 (Improved)
detailed_household_summary_in_household_ Child under 18 ever married: 47.76 → 47.76 (Improved)
detailed_household_summary_in_household_ Child under 18 never married: 1.19 → 1.19 (Improved)
detailed_household_summary_in_household_ Group Quarters- Secondary individual: 38.61 → 38.61 (Improved)
detailed_household_summary_in_household_ Nonrelative of householder: 4.88 → 4.88 (No change)
detailed_household_summary_in_household_ Other relative of householder: 4.22 → 4.22 (No change)
detailed_household_summary_in_household_ Spouse of householder: 1.38 → 1.38 (No change)
migration_code_change_in_msa_ Abroad to MSA: 20.82 → 20.82 (No change)
migration_code_change_in_msa_ Abroad to nonMSA: 62.29 → 62.29 (Improved)
migration_code_change_in_msa_ MSA to MSA: 3.94 → 3.94 (Improved)
migration_code_change_in_msa_ MSA to nonMSA: 16.11 → 16.11 (No change)
migration_code_change_in_msa_ NonMSA to MSA: 16.50 → 16.50 (Improved)
migration_code_change_in_msa_ NonMSA to nonMSA: 8.15 → 8.15 (No change)
migration_code_change_in_msa_ Not identifiable: 20.94 → 20.94 (No change)
migration_code_change_in_msa_ Not in universe: 12.17 → 12.17 (No change)
migration_code_change_in_reg_ Abroad: 19.53 → 19.53 (No change)
migration_code_change_in_reg_ Different county same state: 8.45 → 8.45 (No change)
migration_code_change_in_reg_ Different division same region: 18.37 → 18.37 (Improved)
migration_code_change_in_reg_ Different region: 12.68 → 12.68 (Improved)
migration_code_change_in_reg_ Different state same division: 13.76 → 13.76 (Improved)
migration_code_change_in_reg_ Not in universe: 12.17 → 12.17 (No change)
migration_code_change_in_reg_ Same county: 4.11 → 4.11 (No change)
migration_code_move_within_reg_ Abroad: 19.53 → 19.53 (No change)
migration_code_move_within_reg_ Different county same state: 8.45 → 8.45 (No change)
migration_code_move_within_reg_ Different state in Midwest: 19.15 → 19.15 (No change)
migration_code_move_within_reg_ Different state in Northeast: 21.06 → 21.06 (No change)
migration_code_move_within_reg_ Different state in South: 13.62 → 13.62 (Improved)
migration_code_move_within_reg_ Different state in West: 15.73 → 15.73 (Improved)
migration_code_move_within_reg_ Not in universe: 12.17 → 12.17 (No change)
migration_code_move_within_reg_ Same county: 4.11 → 4.11 (No change)
live_in_this_house_1_year_ago_ No: 3.08 → 3.08 (No change)
migration_prev_res_in_sunbelt_ No: 4.08 → 4.08 (Improved)
migration_prev_res_in_sunbelt_ Yes: 5.54 → 5.54 (Improved)
family_members_under_18_ Both parents present: 1.60 → 1.60 (Improved)
family_members_under_18_ Father only present: 10.01 → 10.01 (Improved)
family_members_under_18_ Mother only present: 3.61 → 3.61 (Improved)
family_members_under_18_ Neither parent present: 10.83 → 10.83 (Improved)
family_members_under_18_ Not in universe: -1.05 → -1.05 (Improved)
country_of_birth_father_ ?: 5.12 → 5.12 (No change)
country_of_birth_father_ Cambodia: 30.36 → 30.36 (No change)
country_of_birth_father_ Canada: 11.71 → 11.71 (Improved)
country_of_birth_father_ China: 15.63 → 15.63 (Improved)
country_of_birth_father_ Columbia: 17.26 → 17.26 (Improved)
country_of_birth_father_ Cuba: 12.99 → 12.99 (Improved)
country_of_birth_father_ Dominican-Republic: 12.44 → 12.44 (Improved)
country_of_birth_father_ Ecuador: 21.83 → 21.83 (Improved)
country_of_birth_father_ El-Salvador: 14.34 → 14.34 (Improved)
country_of_birth_father_ England: 14.15 → 14.15 (Improved)
country_of_birth_father_ France: 31.11 → 31.11 (Improved)
country_of_birth_father_ Germany: 12.63 → 12.63 (Improved)
country_of_birth_father_ Greece: 22.84 → 22.84 (Improved)
country_of_birth_father_ Guatemala: 19.83 → 19.83 (Improved)
country_of_birth_father_ Haiti: 22.84 → 22.84 (Improved)
country_of_birth_father_ Holand-Netherlands: 59.39 → 59.39 (No change)
country_of_birth_father_ Honduras: 31.92 → 31.92 (No change)
country_of_birth_father_ Hong Kong: 44.03 → 44.03 (No change)
country_of_birth_father_ Hungary: 24.01 → 24.01 (No change)
country_of_birth_father_ India: 17.76 → 17.76 (Improved)
country_of_birth_father_ Iran: 28.10 → 28.10 (No change)
country_of_birth_father_ Ireland: 21.43 → 21.43 (Improved)
country_of_birth_father_ Italy: 9.41 → 9.41 (Improved)
country_of_birth_father_ Jamaica: 20.82 → 20.82 (No change)
country_of_birth_father_ Japan: 22.84 → 22.84 (Improved)
country_of_birth_father_ Laos: 32.35 → 32.35 (Improved)
country_of_birth_father_ Mexico: 4.10 → 4.10 (No change)
country_of_birth_father_ Nicaragua: 22.10 → 22.10 (Improved)
country_of_birth_father_ Outlying-U S (Guam USVI etc): 45.18 → 45.18 (Improved)
country_of_birth_father_ Panama: 80.43 → 80.43 (Improved)
country_of_birth_father_ Peru: 26.28 → 26.28 (No change)
country_of_birth_father_ Philippines: 12.63 → 12.63 (Improved)
country_of_birth_father_ Poland: 12.71 → 12.71 (Improved)
country_of_birth_father_ Portugal: 22.25 → 22.25 (Improved)
country_of_birth_father_ Puerto-Rico: 8.22 → 8.22 (Improved)
country_of_birth_father_ Scotland: 28.69 → 28.69 (Improved)
country_of_birth_father_ South Korea: 19.43 → 19.43 (Improved)
country_of_birth_father_ Taiwan: 31.92 → 31.92 (Improved)
country_of_birth_father_ Thailand: 44.03 → 44.03 (No change)
country_of_birth_father_ Trinadad&Tobago: 45.18 → 45.18 (Improved)
country_of_birth_father_ United-States: -1.46 → -1.46 (No change)
country_of_birth_father_ Vietnam: 20.36 → 20.36 (Improved)
country_of_birth_father_ Yugoslavia: 35.35 → 35.35 (Improved)
country_of_birth_mother_ ?: 5.41 → 5.41 (No change)
country_of_birth_mother_ Cambodia: 33.75 → 33.75 (No change)
country_of_birth_mother_ Canada: 11.34 → 11.34 (Improved)
country_of_birth_mother_ China: 16.75 → 16.75 (Improved)
country_of_birth_mother_ Columbia: 17.54 → 17.54 (Improved)
country_of_birth_mother_ Cuba: 13.17 → 13.17 (Improved)
country_of_birth_mother_ Dominican-Republic: 13.72 → 13.72 (Improved)
country_of_birth_mother_ Ecuador: 20.94 → 20.94 (Improved)
country_of_birth_mother_ El-Salvador: 13.20 → 13.20 (Improved)
country_of_birth_mother_ England: 13.72 → 13.72 (Improved)
country_of_birth_mother_ France: 29.32 → 29.32 (No change)
country_of_birth_mother_ Germany: 12.50 → 12.50 (Improved)
country_of_birth_mother_ Greece: 25.60 → 25.60 (No change)
country_of_birth_mother_ Guatemala: 20.04 → 20.04 (No change)
country_of_birth_mother_ Haiti: 22.69 → 22.69 (No change)
country_of_birth_mother_ Holand-Netherlands: 65.66 → 65.66 (No change)
country_of_birth_mother_ Honduras: 32.80 → 32.80 (No change)
country_of_birth_mother_ Hong Kong: 44.03 → 44.03 (No change)
country_of_birth_mother_ Hungary: 24.38 → 24.38 (Improved)
country_of_birth_mother_ India: 17.91 → 17.91 (Improved)
country_of_birth_mother_ Iran: 30.00 → 30.00 (No change)
country_of_birth_mother_ Ireland: 20.25 → 20.25 (No change)
country_of_birth_mother_ Italy: 10.17 → 10.17 (Improved)
country_of_birth_mother_ Jamaica: 21.06 → 21.06 (No change)
country_of_birth_mother_ Japan: 20.94 → 20.94 (Improved)
country_of_birth_mother_ Laos: 33.26 → 33.26 (No change)
country_of_birth_mother_ Mexico: 4.16 → 4.16 (No change)
country_of_birth_mother_ Nicaragua: 22.69 → 22.69 (No change)
country_of_birth_mother_ Outlying-U S (Guam USVI etc): 41.05 → 41.05 (No change)
country_of_birth_mother_ Panama: 74.46 → 74.46 (Improved)
country_of_birth_mother_ Peru: 24.97 → 24.97 (Improved)
country_of_birth_mother_ Philippines: 12.17 → 12.17 (No change)
country_of_birth_mother_ Poland: 13.17 → 13.17 (Improved)
country_of_birth_mother_ Portugal: 24.01 → 24.01 (Improved)
country_of_birth_mother_ Puerto-Rico: 8.57 → 8.57 (No change)
country_of_birth_mother_ Scotland: 31.11 → 31.11 (Improved)
country_of_birth_mother_ South Korea: 18.46 → 18.46 (Improved)
country_of_birth_mother_ Taiwan: 31.11 → 31.11 (Improved)
country_of_birth_mother_ Thailand: 41.98 → 41.98 (No change)
country_of_birth_mother_ Trinadad&Tobago: 49.23 → 49.23 (No change)
country_of_birth_mother_ United-States: -1.51 → -1.51 (Improved)
country_of_birth_mother_ Vietnam: 19.83 → 19.83 (Improved)
country_of_birth_mother_ Yugoslavia: 36.55 → 36.55 (No change)
country_of_birth_self_ ?: 7.47 → 7.47 (Improved)
country_of_birth_self_ Cambodia: 47.76 → 47.76 (Improved)
country_of_birth_self_ Canada: 16.22 → 16.22 (No change)
country_of_birth_self_ China: 21.56 → 21.56 (Improved)
country_of_birth_self_ Columbia: 20.36 → 20.36 (Improved)
country_of_birth_self_ Cuba: 15.06 → 15.06 (Improved)
country_of_birth_self_ Dominican-Republic: 17.33 → 17.33 (No change)
country_of_birth_self_ Ecuador: 25.60 → 25.60 (Improved)
country_of_birth_self_ El-Salvador: 16.22 → 16.22 (Improved)
country_of_birth_self_ England: 18.46 → 18.46 (No change)
country_of_birth_self_ France: 40.19 → 40.19 (No change)
country_of_birth_self_ Germany: 15.29 → 15.29 (No change)
country_of_birth_self_ Greece: 33.75 → 33.75 (No change)
country_of_birth_self_ Guatemala: 23.16 → 23.16 (Improved)
country_of_birth_self_ Haiti: 27.54 → 27.54 (Improved)
country_of_birth_self_ Holand-Netherlands: 80.43 → 80.43 (Improved)
country_of_birth_self_ Honduras: 39.37 → 39.37 (No change)
country_of_birth_self_ Hong Kong: 47.76 → 47.76 (Improved)
country_of_birth_self_ Hungary: 50.85 → 50.85 (Improved)
country_of_birth_self_ India: 20.70 → 20.70 (No change)
country_of_birth_self_ Iran: 33.75 → 33.75 (No change)
country_of_birth_self_ Ireland: 37.88 → 37.88 (Improved)
country_of_birth_self_ Italy: 21.96 → 21.96 (No change)
country_of_birth_self_ Jamaica: 25.38 → 25.38 (No change)
country_of_birth_self_ Japan: 23.16 → 23.16 (Improved)
country_of_birth_self_ Laos: 42.97 → 42.97 (Improved)
country_of_birth_self_ Mexico: 5.60 → 5.60 (No change)
country_of_birth_self_ Nicaragua: 25.60 → 25.60 (Improved)
country_of_birth_self_ Outlying-U S (Guam USVI etc): 44.03 → 44.03 (No change)
country_of_birth_self_ Panama: 80.43 → 80.43 (Improved)
country_of_birth_self_ Peru: 30.36 → 30.36 (No change)
country_of_birth_self_ Philippines: 14.38 → 14.38 (Improved)
country_of_birth_self_ Poland: 21.43 → 21.43 (Improved)
country_of_birth_self_ Portugal: 36.55 → 36.55 (No change)
country_of_birth_self_ Puerto-Rico: 11.52 → 11.52 (No change)
country_of_birth_self_ Scotland: 54.63 → 54.63 (Improved)
country_of_birth_self_ South Korea: 21.18 → 21.18 (Improved)
country_of_birth_self_ Taiwan: 33.75 → 33.75 (No change)
country_of_birth_self_ Thailand: 41.05 → 41.05 (No change)
country_of_birth_self_ Trinadad&Tobago: 62.29 → 62.29 (Improved)
country_of_birth_self_ United-States: -2.40 → -2.40 (Improved)
country_of_birth_self_ Vietnam: 23.16 → 23.16 (Improved)
country_of_birth_self_ Yugoslavia: 62.29 → 62.29 (Improved)
citizenship_ Foreign born- Not a citizen of U S : 3.39 → 3.39 (Improved)
citizenship_ Foreign born- U S citizen by naturalization: 5.54 → 5.54 (Improved)
citizenship_ Native- Born abroad of American Parent(s): 10.42 → 10.42 (Improved)
citizenship_ Native- Born in Puerto Rico or U S Outlying: 11.13 → 11.13 (Improved)
citizenship_ Native- Born in the United States: -2.40 → -2.40 (Improved)
fill_inc_questionnaire_for_veteran_ No: 11.24 → 11.24 (No change)
fill_inc_questionnaire_for_veteran_ Not in universe: -10.10 → -10.10 (Improved)
fill_inc_questionnaire_for_veteran_ Yes: 23.49 → 23.49 (No change)

=== Test Data ===

Skewness analysis (threshold: 0.5):
Found 394 potentially skewed features

Skewness transformation results:
occupation_code: 0.77 → 0.32 (Improved)
wage_per_hour: 8.88 → 3.91 (Improved)
capital_gains: 26.15 → 5.89 (Improved)
capital_losses: 7.64 → 6.90 (Improved)
dividends_from_stocks: 27.41 → 3.32 (Improved)
instance_weight: 1.45 → -0.79 (Improved)
own_business_or_self_employed: 2.82 → 2.78 (Improved)
veterans_benefits: -1.38 → -1.39 (No change)
work_experience: 0.74 → -0.39 (Improved)
capital_ratio: 26.15 → 5.89 (Improved)
full_year_worker: 0.51 → 0.51 (No change)
has_capital_gains: 9.56 → 9.56 (Improved)
has_capital_losses: 6.88 → 6.88 (Improved)
has_dividends: 2.64 → 2.64 (Improved)
is_married: -2.08 → -2.08 (Improved)
enrolled_in_edu_inst_encoded: 4.15 → 3.92 (Improved)
member_of_labor_union_encoded: 3.39 → 3.09 (Improved)
fill_inc_questionnaire_for_veteran_encoded: 11.41 → 10.61 (Improved)
class_of_worker_ Federal government: 8.15 → 8.15 (Improved)
class_of_worker_ Local government: 4.72 → 4.72 (No change)
class_of_worker_ Never worked: 21.53 → 21.53 (Improved)
class_of_worker_ Private: 0.53 → 0.53 (Improved)
class_of_worker_ Self-employed-incorporated: 7.57 → 7.57 (No change)
class_of_worker_ Self-employed-not incorporated: 4.42 → 4.42 (No change)
class_of_worker_ State government: 6.47 → 6.47 (No change)
class_of_worker_ Without pay: 35.58 → 35.58 (Improved)
education_ 10th grade: 4.72 → 4.72 (Improved)
education_ 11th grade: 4.92 → 4.92 (No change)
education_ 12th grade no diploma: 9.02 → 9.02 (Improved)
education_ 1st 2nd 3rd or 4th grade: 10.14 → 10.14 (Improved)
education_ 5th or 6th grade: 7.27 → 7.27 (No change)
education_ 7th and 8th grade: 4.51 → 4.51 (No change)
education_ 9th grade: 5.30 → 5.30 (Improved)
education_ Associates degree-academic program: 6.59 → 6.59 (No change)
education_ Associates degree-occup /vocational: 5.74 → 5.74 (No change)
education_ Bachelors degree(BA AB BS): 2.66 → 2.66 (No change)
education_ Children: 1.41 → 1.41 (No change)
education_ Doctorate degree(PhD EdD): 12.73 → 12.73 (Improved)
education_ High school graduate: 1.15 → 1.15 (No change)
education_ Less than 1st grade: 15.10 → 15.10 (Improved)
education_ Masters degree(MA MS MEng MEd MSW MBA): 5.19 → 5.19 (No change)
education_ Prof school degree (MD DDS DVM LLB JD): 10.64 → 10.64 (No change)
education_ Some college but no degree: 2.02 → 2.02 (No change)
enrolled_in_edu_inst_ College or university: 5.55 → 5.55 (Improved)
enrolled_in_edu_inst_ High school: 4.96 → 4.96 (No change)
enrolled_in_edu_inst_ Not in universe: -3.50 → -3.50 (Improved)
marital_status_ Divorced: 3.47 → 3.47 (Improved)
marital_status_ Married-A F spouse present: 16.74 → 16.74 (Improved)
marital_status_ Married-spouse absent: 11.45 → 11.45 (Improved)
marital_status_ Separated: 7.33 → 7.33 (No change)
marital_status_ Widowed: 3.91 → 3.91 (Improved)
major_industry_code_ Agriculture: 7.93 → 7.93 (No change)
major_industry_code_ Armed Forces: 79.64 → 79.64 (Improved)
major_industry_code_ Business and repair services: 5.43 → 5.43 (No change)
major_industry_code_ Communications: 12.78 → 12.78 (Improved)
major_industry_code_ Construction: 5.33 → 5.33 (Improved)
major_industry_code_ Education: 4.46 → 4.46 (No change)
major_industry_code_ Entertainment: 10.71 → 10.71 (No change)
major_industry_code_ Finance insurance and real estate: 5.43 → 5.43 (No change)
major_industry_code_ Forestry and fisheries: 30.96 → 30.96 (Improved)
major_industry_code_ Hospital services: 7.04 → 7.04 (Improved)
major_industry_code_ Manufacturing-durable goods: 4.36 → 4.36 (Improved)
major_industry_code_ Manufacturing-nondurable goods: 5.06 → 5.06 (No change)
major_industry_code_ Medical except hospital: 6.25 → 6.25 (No change)
major_industry_code_ Mining: 17.19 → 17.19 (No change)
major_industry_code_ Other professional services: 6.46 → 6.46 (Improved)
major_industry_code_ Personal services except private HH: 7.94 → 7.94 (No change)
major_industry_code_ Private household services: 13.79 → 13.79 (No change)
major_industry_code_ Public administration: 6.45 → 6.45 (No change)
major_industry_code_ Retail trade: 2.85 → 2.85 (Improved)
major_industry_code_ Social services: 8.69 → 8.69 (No change)
major_industry_code_ Transportation: 6.41 → 6.41 (No change)
major_industry_code_ Utilities and sanitary services: 13.07 → 13.07 (Improved)
major_industry_code_ Wholesale trade: 7.08 → 7.08 (Improved)
major_occupation_code_ Adm support including clerical: 3.22 → 3.22 (Improved)
major_occupation_code_ Armed Forces: 79.64 → 79.64 (Improved)
major_occupation_code_ Executive admin and managerial: 3.57 → 3.57 (No change)
major_occupation_code_ Farming forestry and fishing: 7.76 → 7.76 (Improved)
major_occupation_code_ Handlers equip cleaners etc : 6.59 → 6.59 (Improved)
major_occupation_code_ Machine operators assmblrs & inspctrs: 5.22 → 5.22 (Improved)
major_occupation_code_ Other service: 3.54 → 3.54 (Improved)
major_occupation_code_ Precision production craft & repair: 3.89 → 3.89 (Improved)
major_occupation_code_ Private household services: 14.88 → 14.88 (Improved)
major_occupation_code_ Professional specialty: 3.37 → 3.37 (Improved)
major_occupation_code_ Protective services: 10.93 → 10.93 (No change)
major_occupation_code_ Sales: 3.63 → 3.63 (Improved)
major_occupation_code_ Technicians and related support: 7.93 → 7.93 (No change)
major_occupation_code_ Transportation and material moving: 6.67 → 6.67 (Improved)
race_ Amer Indian Aleut or Eskimo: 8.78 → 8.78 (Improved)
race_ Asian or Pacific Islander: 5.50 → 5.50 (Improved)
race_ Black: 2.61 → 2.61 (Improved)
race_ Other: 6.90 → 6.90 (No change)
race_ White: -1.80 → -1.80 (No change)
hispanic_origin_ All other: -2.04 → -2.04 (Improved)
hispanic_origin_ Central or South American: 6.72 → 6.72 (Improved)
hispanic_origin_ Chicano: 23.74 → 23.74 (Improved)
hispanic_origin_ Cuban: 12.40 → 12.40 (No change)
hispanic_origin_ Do not know: 26.11 → 26.11 (Improved)
hispanic_origin_ Mexican (Mexicano): 4.81 → 4.81 (Improved)
hispanic_origin_ Mexican-American: 4.64 → 4.64 (Improved)
hispanic_origin_ NA: 15.43 → 15.43 (Improved)
hispanic_origin_ Other Spanish: 8.65 → 8.65 (No change)
hispanic_origin_ Puerto Rican: 7.60 → 7.60 (No change)
member_of_labor_union_ No: 3.06 → 3.06 (Improved)
member_of_labor_union_ Not in universe: -2.74 → -2.74 (Improved)
member_of_labor_union_ Yes: 8.19 → 8.19 (Improved)
reason_for_unemployment_ Job leaver: 18.19 → 18.19 (Improved)
reason_for_unemployment_ Job loser - on layoff: 13.59 → 13.59 (Improved)
reason_for_unemployment_ New entrant: 21.53 → 21.53 (Improved)
reason_for_unemployment_ Not in universe: -5.25 → -5.25 (Improved)
reason_for_unemployment_ Other job loser: 9.14 → 9.14 (Improved)
reason_for_unemployment_ Re-entrant: 9.50 → 9.50 (No change)
full_or_part_time_employment_ Full-time schedules: 1.42 → 1.42 (Improved)
full_or_part_time_employment_ Not in labor force: 2.09 → 2.09 (No change)
full_or_part_time_employment_ PT for econ reasons usually FT: 19.13 → 19.13 (Improved)
full_or_part_time_employment_ PT for econ reasons usually PT: 12.81 → 12.81 (Improved)
full_or_part_time_employment_ PT for non-econ reasons usually FT: 7.28 → 7.28 (Improved)
full_or_part_time_employment_ Unemployed full-time: 8.74 → 8.74 (Improved)
full_or_part_time_employment_ Unemployed part- time: 15.19 → 15.19 (Improved)
tax_filer_status_ Head of household: 4.76 → 4.76 (Improved)
tax_filer_status_ Joint both 65+: 4.51 → 4.51 (Improved)
tax_filer_status_ Joint both under 65: 0.64 → 0.64 (Improved)
tax_filer_status_ Joint one under 65 & one 65+: 6.81 → 6.81 (No change)
tax_filer_status_ Nonfiler: 0.60 → 0.60 (Improved)
tax_filer_status_ Single: 1.54 → 1.54 (No change)
region_of_previous_residence_ Abroad: 21.69 → 21.69 (Improved)
region_of_previous_residence_ Midwest: 7.35 → 7.35 (Improved)
region_of_previous_residence_ Northeast: 8.36 → 8.36 (Improved)
region_of_previous_residence_ Not in universe: -3.10 → -3.10 (Improved)
region_of_previous_residence_ South: 6.07 → 6.07 (No change)
region_of_previous_residence_ West: 6.66 → 6.66 (Improved)
state_of_previous_residence_ ?: 16.97 → 16.97 (No change)
state_of_previous_residence_ Abroad: 18.95 → 18.95 (Improved)
state_of_previous_residence_ Alabama: 29.78 → 29.78 (No change)
state_of_previous_residence_ Alaska: 28.11 → 28.11 (No change)
state_of_previous_residence_ Arizona: 26.40 → 26.40 (No change)
state_of_previous_residence_ Arkansas: 30.96 → 30.96 (No change)
state_of_previous_residence_ California: 10.32 → 10.32 (Improved)
state_of_previous_residence_ Colorado: 29.10 → 29.10 (Improved)
state_of_previous_residence_ Connecticut: 38.23 → 38.23 (Improved)
state_of_previous_residence_ Delaware: 45.46 → 45.46 (No change)
state_of_previous_residence_ District of Columbia: 41.95 → 41.95 (Improved)
state_of_previous_residence_ Florida: 14.56 → 14.56 (Improved)
state_of_previous_residence_ Georgia: 29.23 → 29.23 (No change)
state_of_previous_residence_ Idaho: 85.55 → 85.55 (No change)
state_of_previous_residence_ Illinois: 33.42 → 33.42 (Improved)
state_of_previous_residence_ Indiana: 17.94 → 17.94 (Improved)
state_of_previous_residence_ Iowa: 38.53 → 38.53 (No change)
state_of_previous_residence_ Kansas: 36.83 → 36.83 (No change)
state_of_previous_residence_ Kentucky: 29.10 → 29.10 (Improved)
state_of_previous_residence_ Louisiana: 28.47 → 28.47 (No change)
state_of_previous_residence_ Maine: 38.53 → 38.53 (No change)
state_of_previous_residence_ Maryland: 34.67 → 34.67 (No change)
state_of_previous_residence_ Massachusetts: 36.07 → 36.07 (No change)
state_of_previous_residence_ Michigan: 21.97 → 21.97 (Improved)
state_of_previous_residence_ Minnesota: 18.23 → 18.23 (Improved)
state_of_previous_residence_ Mississippi: 34.89 → 34.89 (Improved)
state_of_previous_residence_ Missouri: 36.83 → 36.83 (No change)
state_of_previous_residence_ Montana: 30.20 → 30.20 (Improved)
state_of_previous_residence_ Nebraska: 30.96 → 30.96 (No change)
state_of_previous_residence_ Nevada: 35.12 → 35.12 (No change)
state_of_previous_residence_ New Hampshire: 29.23 → 29.23 (No change)
state_of_previous_residence_ New Jersey: 50.02 → 50.02 (Improved)
state_of_previous_residence_ New Mexico: 20.59 → 20.59 (Improved)
state_of_previous_residence_ New York: 29.78 → 29.78 (No change)
state_of_previous_residence_ North Carolina: 15.47 → 15.47 (Improved)
state_of_previous_residence_ North Dakota: 20.63 → 20.63 (No change)
state_of_previous_residence_ Not in universe: -3.10 → -3.10 (Improved)
state_of_previous_residence_ Ohio: 30.96 → 30.96 (No change)
state_of_previous_residence_ Oklahoma: 17.85 → 17.85 (No change)
state_of_previous_residence_ Oregon: 29.64 → 29.64 (No change)
state_of_previous_residence_ Pennsylvania: 31.44 → 31.44 (No change)
state_of_previous_residence_ South Carolina: 44.04 → 44.04 (No change)
state_of_previous_residence_ South Dakota: 36.07 → 36.07 (No change)
state_of_previous_residence_ Tennessee: 31.77 → 31.77 (No change)
state_of_previous_residence_ Texas: 31.77 → 31.77 (No change)
state_of_previous_residence_ Utah: 13.30 → 13.30 (Improved)
state_of_previous_residence_ Vermont: 30.96 → 30.96 (Improved)
state_of_previous_residence_ Virginia: 40.47 → 40.47 (Improved)
state_of_previous_residence_ West Virginia: 30.50 → 30.50 (No change)
state_of_previous_residence_ Wisconsin: 39.14 → 39.14 (Improved)
state_of_previous_residence_ Wyoming: 30.80 → 30.80 (Improved)
detailed_household_summary_ Child 18+ ever marr Not in a subfamily: 14.25 → 14.25 (No change)
detailed_household_summary_ Child 18+ ever marr RP of subfamily: 16.31 → 16.31 (No change)
detailed_household_summary_ Child 18+ never marr Not in a subfamily: 3.56 → 3.56 (Improved)
detailed_household_summary_ Child 18+ never marr RP of subfamily: 17.52 → 17.52 (No change)
detailed_household_summary_ Child 18+ spouse of subfamily RP: 35.12 → 35.12 (No change)
detailed_household_summary_ Child <18 ever marr RP of subfamily: 125.94 → 125.94 (Improved)
detailed_household_summary_ Child <18 ever marr not in subfamily: 79.64 → 79.64 (Improved)
detailed_household_summary_ Child <18 never marr RP of subfamily: 49.37 → 49.37 (Improved)
detailed_household_summary_ Child <18 never marr not in subfamily: 1.30 → 1.30 (No change)
detailed_household_summary_ Child <18 spouse of subfamily RP: 308.51 → 308.51 (Improved)
detailed_household_summary_ Child under 18 of RP of unrel subfamily: 16.42 → 16.42 (Improved)
detailed_household_summary_ Grandchild 18+ ever marr RP of subfamily: 125.94 → 125.94 (Improved)
detailed_household_summary_ Grandchild 18+ ever marr not in subfamily: 70.76 → 70.76 (No change)
detailed_household_summary_ Grandchild 18+ never marr RP of subfamily: 308.51 → 308.51 (Improved)
detailed_household_summary_ Grandchild 18+ never marr not in subfamily: 21.37 → 21.37 (No change)
detailed_household_summary_ Grandchild 18+ spouse of subfamily RP: 154.25 → 154.25 (Improved)
detailed_household_summary_ Grandchild <18 never marr RP of subfamily: 308.51 → 308.51 (No change)
detailed_household_summary_ Grandchild <18 never marr child of subfamily RP: 10.38 → 10.38 (Improved)
detailed_household_summary_ Grandchild <18 never marr not in subfamily: 13.74 → 13.74 (Improved)
detailed_household_summary_ Householder: 1.02 → 1.02 (No change)
detailed_household_summary_ In group quarters: 34.45 → 34.45 (No change)
detailed_household_summary_ Nonfamily householder: 2.41 → 2.41 (No change)
detailed_household_summary_ Other Rel 18+ ever marr RP of subfamily: 17.49 → 17.49 (Improved)
detailed_household_summary_ Other Rel 18+ ever marr not in subfamily: 9.54 → 9.54 (No change)
detailed_household_summary_ Other Rel 18+ never marr RP of subfamily: 38.83 → 38.83 (No change)
detailed_household_summary_ Other Rel 18+ never marr not in subfamily: 10.56 → 10.56 (Improved)
detailed_household_summary_ Other Rel 18+ spouse of subfamily RP: 16.31 → 16.31 (Improved)
detailed_household_summary_ Other Rel <18 ever marr RP of subfamily: 178.11 → 178.11 (No change)
detailed_household_summary_ Other Rel <18 ever marr not in subfamily: 218.15 → 218.15 (Improved)
detailed_household_summary_ Other Rel <18 never marr child of subfamily RP: 17.82 → 17.82 (Improved)
detailed_household_summary_ Other Rel <18 never marr not in subfamily: 17.88 → 17.88 (Improved)
detailed_household_summary_ Other Rel <18 never married RP of subfamily: 218.15 → 218.15 (Improved)
detailed_household_summary_ Other Rel <18 spouse of subfamily RP: 218.15 → 218.15 (Improved)
detailed_household_summary_ RP of unrelated subfamily: 16.67 → 16.67 (Improved)
detailed_household_summary_ Secondary individual: 5.27 → 5.27 (No change)
detailed_household_summary_ Spouse of RP of unrelated subfamily: 62.95 → 62.95 (Improved)
detailed_household_summary_ Spouse of householder: 1.37 → 1.37 (Improved)
detailed_household_summary_in_household_ Child 18 or older: 3.18 → 3.18 (No change)
detailed_household_summary_in_household_ Child under 18 ever married: 65.75 → 65.75 (Improved)
detailed_household_summary_in_household_ Child under 18 never married: 1.30 → 1.30 (Improved)
detailed_household_summary_in_household_ Group Quarters- Secondary individual: 42.34 → 42.34 (No change)
detailed_household_summary_in_household_ Nonrelative of householder: 4.69 → 4.69 (No change)
detailed_household_summary_in_household_ Other relative of householder: 4.13 → 4.13 (No change)
detailed_household_summary_in_household_ Spouse of householder: 1.37 → 1.37 (Improved)
migration_code_change_in_msa_ Abroad to MSA: 24.03 → 24.03 (No change)
migration_code_change_in_msa_ Abroad to nonMSA: 51.39 → 51.39 (Improved)
migration_code_change_in_msa_ MSA to MSA: 3.92 → 3.92 (No change)
migration_code_change_in_msa_ MSA to nonMSA: 16.12 → 16.12 (No change)
migration_code_change_in_msa_ NonMSA to MSA: 17.94 → 17.94 (Improved)
migration_code_change_in_msa_ NonMSA to nonMSA: 8.28 → 8.28 (Improved)
migration_code_change_in_msa_ Not identifiable: 21.80 → 21.80 (No change)
migration_code_change_in_msa_ Not in universe: 12.40 → 12.40 (No change)
migration_code_change_in_reg_ Abroad: 21.69 → 21.69 (Improved)
migration_code_change_in_reg_ Different county same state: 8.35 → 8.35 (Improved)
migration_code_change_in_reg_ Different division same region: 20.59 → 20.59 (Improved)
migration_code_change_in_reg_ Different region: 12.70 → 12.70 (Improved)
migration_code_change_in_reg_ Different state same division: 14.16 → 14.16 (Improved)
migration_code_change_in_reg_ Not in universe: 12.40 → 12.40 (No change)
migration_code_change_in_reg_ Same county: 4.11 → 4.11 (No change)
migration_code_move_within_reg_ Abroad: 21.69 → 21.69 (Improved)
migration_code_move_within_reg_ Different county same state: 8.35 → 8.35 (Improved)
migration_code_move_within_reg_ Different state in Midwest: 19.84 → 19.84 (Improved)
migration_code_move_within_reg_ Different state in Northeast: 21.43 → 21.43 (No change)
migration_code_move_within_reg_ Different state in South: 14.11 → 14.11 (Improved)
migration_code_move_within_reg_ Different state in West: 16.33 → 16.33 (No change)
migration_code_move_within_reg_ Not in universe: 12.40 → 12.40 (No change)
migration_code_move_within_reg_ Same county: 4.11 → 4.11 (No change)
live_in_this_house_1_year_ago_ No: 3.10 → 3.10 (Improved)
migration_prev_res_in_sunbelt_ No: 4.14 → 4.14 (No change)
migration_prev_res_in_sunbelt_ Yes: 5.52 → 5.52 (No change)
family_members_under_18_ Both parents present: 1.75 → 1.75 (No change)
family_members_under_18_ Father only present: 9.98 → 9.98 (Improved)
family_members_under_18_ Mother only present: 3.62 → 3.62 (Improved)
family_members_under_18_ Neither parent present: 10.81 → 10.81 (Improved)
family_members_under_18_ Not in universe: -1.15 → -1.15 (Improved)
country_of_birth_father_ ?: 5.04 → 5.04 (No change)
country_of_birth_father_ Cambodia: 28.59 → 28.59 (Improved)
country_of_birth_father_ Canada: 12.00 → 12.00 (Improved)
country_of_birth_father_ China: 15.51 → 15.51 (Improved)
country_of_birth_father_ Columbia: 18.03 → 18.03 (No change)
country_of_birth_father_ Cuba: 12.68 → 12.68 (No change)
country_of_birth_father_ Dominican-Republic: 11.74 → 11.74 (Improved)
country_of_birth_father_ Ecuador: 22.20 → 22.20 (No change)
country_of_birth_father_ El-Salvador: 13.72 → 13.72 (Improved)
country_of_birth_father_ England: 15.90 → 15.90 (No change)
country_of_birth_father_ France: 33.82 → 33.82 (Improved)
country_of_birth_father_ Germany: 12.11 → 12.11 (Improved)
country_of_birth_father_ Greece: 21.75 → 21.75 (Improved)
country_of_birth_father_ Guatemala: 20.36 → 20.36 (Improved)
country_of_birth_father_ Haiti: 24.72 → 24.72 (Improved)
country_of_birth_father_ Holand-Netherlands: 67.30 → 67.30 (No change)
country_of_birth_father_ Honduras: 30.06 → 30.06 (Improved)
country_of_birth_father_ Hong Kong: 44.04 → 44.04 (No change)
country_of_birth_father_ Hungary: 23.96 → 23.96 (No change)
country_of_birth_father_ India: 17.76 → 17.76 (Improved)
country_of_birth_father_ Iran: 30.35 → 30.35 (No change)
country_of_birth_father_ Ireland: 17.70 → 17.70 (Improved)
country_of_birth_father_ Italy: 9.15 → 9.15 (Improved)
country_of_birth_father_ Jamaica: 19.47 → 19.47 (No change)
country_of_birth_father_ Japan: 21.43 → 21.43 (Improved)
country_of_birth_father_ Laos: 34.45 → 34.45 (Improved)
country_of_birth_father_ Mexico: 4.04 → 4.04 (No change)
country_of_birth_father_ Nicaragua: 23.19 → 23.19 (Improved)
country_of_birth_father_ Outlying-U S (Guam USVI etc): 35.58 → 35.58 (No change)
country_of_birth_father_ Panama: 77.11 → 77.11 (Improved)
country_of_birth_father_ Peru: 24.56 → 24.56 (Improved)
country_of_birth_father_ Philippines: 12.62 → 12.62 (Improved)
country_of_birth_father_ Poland: 12.33 → 12.33 (Improved)
country_of_birth_father_ Portugal: 22.03 → 22.03 (Improved)
country_of_birth_father_ Puerto-Rico: 8.46 → 8.46 (No change)
country_of_birth_father_ Scotland: 28.59 → 28.59 (Improved)
country_of_birth_father_ South Korea: 19.28 → 19.28 (No change)
country_of_birth_father_ Taiwan: 34.03 → 34.03 (Improved)
country_of_birth_father_ Thailand: 41.95 → 41.95 (Improved)
country_of_birth_father_ Trinadad&Tobago: 37.65 → 37.65 (Improved)
country_of_birth_father_ United-States: -1.42 → -1.42 (No change)
country_of_birth_father_ Vietnam: 20.92 → 20.92 (Improved)
country_of_birth_father_ Yugoslavia: 28.11 → 28.11 (No change)
country_of_birth_mother_ ?: 5.35 → 5.35 (No change)
country_of_birth_mother_ Cambodia: 30.80 → 30.80 (Improved)
country_of_birth_mother_ Canada: 11.65 → 11.65 (No change)
country_of_birth_mother_ China: 16.10 → 16.10 (Improved)
country_of_birth_mother_ Columbia: 18.07 → 18.07 (Improved)
country_of_birth_mother_ Cuba: 12.58 → 12.58 (Improved)
country_of_birth_mother_ Dominican-Republic: 13.25 → 13.25 (Improved)
country_of_birth_mother_ Ecuador: 21.75 → 21.75 (Improved)
country_of_birth_mother_ El-Salvador: 13.26 → 13.26 (Improved)
country_of_birth_mother_ England: 14.99 → 14.99 (Improved)
country_of_birth_mother_ France: 33.03 → 33.03 (No change)
country_of_birth_mother_ Germany: 11.82 → 11.82 (No change)
country_of_birth_mother_ Greece: 24.96 → 24.96 (Improved)
country_of_birth_mother_ Guatemala: 20.40 → 20.40 (No change)
country_of_birth_mother_ Haiti: 24.33 → 24.33 (Improved)
country_of_birth_mother_ Holand-Netherlands: 68.96 → 68.96 (No change)
country_of_birth_mother_ Honduras: 29.10 → 29.10 (Improved)
country_of_birth_mother_ Hong Kong: 44.97 → 44.97 (No change)
country_of_birth_mother_ Hungary: 23.81 → 23.81 (No change)
country_of_birth_mother_ India: 17.64 → 17.64 (No change)
country_of_birth_mother_ Iran: 32.30 → 32.30 (No change)
country_of_birth_mother_ Ireland: 16.84 → 16.84 (Improved)
country_of_birth_mother_ Italy: 10.14 → 10.14 (No change)
country_of_birth_mother_ Jamaica: 19.47 → 19.47 (No change)
country_of_birth_mother_ Japan: 18.98 → 18.98 (Improved)
country_of_birth_mother_ Laos: 36.57 → 36.57 (No change)
country_of_birth_mother_ Mexico: 4.06 → 4.06 (Improved)
country_of_birth_mother_ Nicaragua: 22.74 → 22.74 (Improved)
country_of_birth_mother_ Outlying-U S (Guam USVI etc): 39.46 → 39.46 (Improved)
country_of_birth_mother_ Panama: 72.70 → 72.70 (Improved)
country_of_birth_mother_ Peru: 23.96 → 23.96 (Improved)
country_of_birth_mother_ Philippines: 12.07 → 12.07 (Improved)
country_of_birth_mother_ Poland: 12.74 → 12.74 (Improved)
country_of_birth_mother_ Portugal: 23.06 → 23.06 (Improved)
country_of_birth_mother_ Puerto-Rico: 8.79 → 8.79 (Improved)
country_of_birth_mother_ Scotland: 29.10 → 29.10 (No change)
country_of_birth_mother_ South Korea: 17.91 → 17.91 (Improved)
country_of_birth_mother_ Taiwan: 30.50 → 30.50 (Improved)
country_of_birth_mother_ Thailand: 36.83 → 36.83 (No change)
country_of_birth_mother_ Trinadad&Tobago: 41.19 → 41.19 (No change)
country_of_birth_mother_ United-States: -1.47 → -1.47 (No change)
country_of_birth_mother_ Vietnam: 20.14 → 20.14 (No change)
country_of_birth_mother_ Yugoslavia: 30.80 → 30.80 (Improved)
country_of_birth_self_ ?: 7.19 → 7.19 (Improved)
country_of_birth_self_ Cambodia: 38.83 → 38.83 (No change)
country_of_birth_self_ Canada: 16.67 → 16.67 (Improved)
country_of_birth_self_ China: 20.01 → 20.01 (Improved)
country_of_birth_self_ Columbia: 21.17 → 21.17 (Improved)
country_of_birth_self_ Cuba: 14.86 → 14.86 (Improved)
country_of_birth_self_ Dominican-Republic: 17.03 → 17.03 (No change)
country_of_birth_self_ Ecuador: 26.70 → 26.70 (Improved)
country_of_birth_self_ El-Salvador: 16.59 → 16.59 (Improved)
country_of_birth_self_ England: 20.63 → 20.63 (No change)
country_of_birth_self_ France: 40.83 → 40.83 (No change)
country_of_birth_self_ Germany: 15.06 → 15.06 (Improved)
country_of_birth_self_ Greece: 36.57 → 36.57 (Improved)
country_of_birth_self_ Guatemala: 23.96 → 23.96 (No change)
country_of_birth_self_ Haiti: 30.96 → 30.96 (No change)
country_of_birth_self_ Holand-Netherlands: 93.01 → 93.01 (Improved)
country_of_birth_self_ Honduras: 35.12 → 35.12 (No change)
country_of_birth_self_ Hong Kong: 42.75 → 42.75 (Improved)
country_of_birth_self_ Hungary: 56.30 → 56.30 (No change)
country_of_birth_self_ India: 21.43 → 21.43 (No change)
country_of_birth_self_ Iran: 37.94 → 37.94 (Improved)
country_of_birth_self_ Ireland: 36.83 → 36.83 (No change)
country_of_birth_self_ Italy: 20.54 → 20.54 (Improved)
country_of_birth_self_ Jamaica: 23.19 → 23.19 (Improved)
country_of_birth_self_ Japan: 23.06 → 23.06 (Improved)
country_of_birth_self_ Laos: 41.95 → 41.95 (Improved)
country_of_birth_self_ Mexico: 5.45 → 5.45 (Improved)
country_of_birth_self_ Nicaragua: 27.00 → 27.00 (Improved)
country_of_birth_self_ Outlying-U S (Guam USVI etc): 45.46 → 45.46 (No change)
country_of_birth_self_ Panama: 93.01 → 93.01 (Improved)
country_of_birth_self_ Peru: 28.11 → 28.11 (Improved)
country_of_birth_self_ Philippines: 14.41 → 14.41 (No change)
country_of_birth_self_ Poland: 23.60 → 23.60 (No change)
country_of_birth_self_ Portugal: 31.77 → 31.77 (Improved)
country_of_birth_self_ Puerto-Rico: 11.62 → 11.62 (Improved)
country_of_birth_self_ Scotland: 52.12 → 52.12 (No change)
country_of_birth_self_ South Korea: 20.09 → 20.09 (Improved)
country_of_birth_self_ Taiwan: 34.45 → 34.45 (Improved)
country_of_birth_self_ Thailand: 38.83 → 38.83 (No change)
country_of_birth_self_ Trinadad&Tobago: 45.46 → 45.46 (No change)
country_of_birth_self_ United-States: -2.36 → -2.36 (No change)
country_of_birth_self_ Vietnam: 22.93 → 22.93 (Improved)
country_of_birth_self_ Yugoslavia: 44.97 → 44.97 (No change)
citizenship_ Foreign born- Not a citizen of U S : 3.37 → 3.37 (Improved)
citizenship_ Foreign born- U S citizen by naturalization: 5.38 → 5.38 (No change)
citizenship_ Native- Born abroad of American Parent(s): 9.95 → 9.95 (No change)
citizenship_ Native- Born in Puerto Rico or U S Outlying: 11.24 → 11.24 (Improved)
citizenship_ Native- Born in the United States: -2.36 → -2.36 (Improved)
fill_inc_questionnaire_for_veteran_ No: 10.71 → 10.71 (No change)
fill_inc_questionnaire_for_veteran_ Not in universe: -9.58 → -9.58 (No change)
fill_inc_questionnaire_for_veteran_ Yes: 21.97 → 21.97 (No change)

4.9 Scale/Normalize¶

InĀ [69]:
# Function to describe data before and after scaling
def print_scaling_info(data_before, data_after, dataset_name=""):
    """Print statistical information about scaling effects"""
    print(f"\n{dataset_name} Scaling Results:")
    
    # Convert to numpy if needed
    if isinstance(data_after, pd.DataFrame):
        data_after = data_after.values
    if isinstance(data_before, pd.DataFrame):
        data_before = data_before.values
        
    # Calculate statistics
    before_mean = np.mean(data_before)
    before_std = np.std(data_before)
    before_min = np.min(data_before)
    before_max = np.max(data_before)
    
    after_mean = np.mean(data_after)
    after_std = np.std(data_after)
    after_min = np.min(data_after)
    after_max = np.max(data_after)
    
    # Print comparison
    print(f"  Before scaling: mean={before_mean:.4f}, std={before_std:.4f}, min={before_min:.4f}, max={before_max:.4f}")
    print(f"  After scaling:  mean={after_mean:.4f}, std={after_std:.4f}, min={after_min:.4f}, max={after_max:.4f}")

# Apply Robust Scaling
print("\nApplying RobustScaler to handle outliers and create uniform feature scales...")
scaler = RobustScaler()

# Fit scaler on training data
X_train_scaled = scaler.fit_transform(X_train_skew)

# Apply same transformation to validation and test data
X_val_scaled = scaler.transform(X_val_skew)
X_test_scaled = scaler.transform(X_test_skew)

# Print information about the scaling results
print_scaling_info(X_train_skew, X_train_scaled, "Training Data")
print_scaling_info(X_val_skew, X_val_scaled, "Validation Data")
print_scaling_info(X_test_skew, X_test_scaled, "Test Data")

# Convert back to DataFrames to preserve column names
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_skew.columns, index=X_train_skew.index)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X_val_skew.columns, index=X_val_skew.index) 
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_skew.columns, index=X_test_skew.index)

print(f"\nScaled dataset shapes:")
print(f"  X_train_scaled: {X_train_scaled_df.shape}")
print(f"  X_val_scaled: {X_val_scaled_df.shape}")
print(f"  X_test_scaled: {X_test_scaled_df.shape}")
Applying RobustScaler to handle outliers and create uniform feature scales...

Training Data Scaling Results:
  Before scaling: mean=0.8399, std=5.2905, min=-1.0000, max=95.0000
  After scaling:  mean=0.0302, std=0.2654, min=-5.0888, max=10.8198

Validation Data Scaling Results:
  Before scaling: mean=0.8567, std=5.4131, min=-1.0000, max=95.0000
  After scaling:  mean=0.0240, std=0.6453, min=-5.1320, max=17.5763

Test Data Scaling Results:
  Before scaling: mean=0.8842, std=5.4257, min=-1.0000, max=95.0000
  After scaling:  mean=0.0484, std=0.6541, min=-4.9534, max=17.5763

Scaled dataset shapes:
  X_train_scaled: (155305, 420)
  X_val_scaled: (38829, 420)
  X_test_scaled: (95180, 420)

4.10 Dimensionality Reduction¶

InĀ [70]:
def apply_pca(X_train_scaled, X_val_scaled, X_test_scaled, variance_threshold=0.95):
    """
    Apply PCA to scaled data and generate a scree plot
    
    Args:
        X_train_scaled, X_val_scaled, X_test_scaled: Scaled numpy arrays or DataFrames
        variance_threshold: Desired explained variance (default 0.95)
        
    Returns:
        PCA-transformed datasets
    """
    # Convert to numpy arrays if DataFrames
    if isinstance(X_train_scaled, pd.DataFrame):
        # Save the indices before conversion
        train_index = X_train_scaled.index
        val_index = X_val_scaled.index
        test_index = X_test_scaled.index
        
        # Convert to numpy for PCA
        X_train_scaled_np = X_train_scaled.values
        X_val_scaled_np = X_val_scaled.values
        X_test_scaled_np = X_test_scaled.values
    else:
        # Already numpy arrays, need to get indices from another source
        train_index = np.arange(X_train_scaled.shape[0])
        val_index = np.arange(X_val_scaled.shape[0])
        test_index = np.arange(X_test_scaled.shape[0])
        
        X_train_scaled_np = X_train_scaled
        X_val_scaled_np = X_val_scaled
        X_test_scaled_np = X_test_scaled
    
    # Fit PCA only on training data
    pca = PCA(n_components=variance_threshold)
    X_train_pca = pca.fit_transform(X_train_scaled_np)
    X_val_pca = pca.transform(X_val_scaled_np)
    X_test_pca = pca.transform(X_test_scaled_np)
    
    # Create PCA DataFrames
    pca_cols = [f'PC{i+1}' for i in range(X_train_pca.shape[1])]
    X_train_pca_df = pd.DataFrame(X_train_pca, columns=pca_cols, index=train_index)
    X_val_pca_df = pd.DataFrame(X_val_pca, columns=pca_cols, index=val_index)
    X_test_pca_df = pd.DataFrame(X_test_pca, columns=pca_cols, index=test_index)
    
    # Generate scree plot
    plt.figure(figsize=(10, 6))
    
    # Get explained variance
    explained_variance = pca.explained_variance_ratio_
    cumulative_variance = np.cumsum(explained_variance)
    n_components = len(explained_variance)
    
    # Bar chart for individual explained variance
    plt.bar(range(1, n_components+1), explained_variance, alpha=0.7, 
            align='center', label='Individual explained variance', color='skyblue')
    
    # Line plot for cumulative explained variance
    plt.step(range(1, n_components+1), cumulative_variance, where='mid',
             label='Cumulative explained variance', color='red', linewidth=2)
    
    # Add a horizontal line at variance threshold
    plt.axhline(y=variance_threshold, color='green', linestyle='--', 
                label=f'Variance threshold: {variance_threshold}')
    
    # Annotate key points
    for i, (ev, cv) in enumerate(zip(explained_variance, cumulative_variance)):
        if i == 0 or i == n_components-1:  # Always label first and last components
            plt.text(i+1, cv+0.02, f'{cv:.2f}', ha='center', color='darkred', fontweight='bold')
        elif cv >= variance_threshold and cv-explained_variance[i] < variance_threshold:
            # Label the component that crosses the threshold
            plt.text(i+1, cv+0.02, f'{cv:.2f}', ha='center', color='darkred', fontweight='bold')
            plt.axvline(x=i+1, color='green', linestyle='--', alpha=0.3)
    
    # Add labels and title
    plt.ylabel('Explained Variance Ratio', fontsize=12)
    plt.xlabel('Principal Component', fontsize=12)
    plt.title('PCA Scree Plot: Explained Variance by Component', fontsize=14)
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    # Print summary statistics
    print(f"Explained variance ratio: {np.sum(pca.explained_variance_ratio_):.4f}")
    print(f"Original number of features: {X_train_scaled_np.shape[1]}")
    print(f"Features after PCA: {X_train_pca_df.shape[1]}")
    
    # Find how many components needed for threshold
    components_for_threshold = np.where(cumulative_variance >= variance_threshold)[0][0] + 1
    print(f"Components needed for {variance_threshold*100:.0f}% variance: {components_for_threshold}")
    
    # Show plot
    plt.show()
    
    return X_train_pca_df, X_val_pca_df, X_test_pca_df, pca

# To apply PCA, replace these lines after the scaling step:
# Apply PCA to the scaled data (can be commented out to skip PCA)
X_train_pca, X_val_pca, X_test_pca, pca = apply_pca(X_train_scaled_df, X_val_scaled_df, X_test_scaled_df)

# Set these variables for next steps - allowing flexibility to use or skip PCA
# If PCA is used:
X_train_processed = X_train_pca
X_val_processed = X_val_pca
X_test_processed = X_test_pca
# If PCA is skipped (commented out):
# X_train_processed = X_train_scaled_df
# X_val_processed = X_val_scaled_df
# X_test_processed = X_test_scaled_df
Explained variance ratio: 0.9501
Original number of features: 420
Features after PCA: 40
Components needed for 95% variance: 40
No description has been provided for this image

4.11 Remove Multicollinearity¶

InĀ [71]:
def remove_multicollinearity(X_train, X_val, X_test, threshold=0.75):
    """
    Remove highly correlated features
    
    This function works with either DataFrames or numpy arrays
    """
    # Convert to DataFrame if numpy arrays
    is_numpy = isinstance(X_train, np.ndarray)
    if is_numpy:
        X_train = pd.DataFrame(X_train)
        X_val = pd.DataFrame(X_val)
        X_test = pd.DataFrame(X_test)
    
    # Calculate correlation matrix
    corr_matrix = X_train.corr().abs()
    
    # Find high correlation pairs
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    high_corr_pairs = []
    
    # Detailed correlation reporting
    print("\nFeature correlations above threshold:")
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] > threshold:
                col1 = corr_matrix.columns[i]
                col2 = corr_matrix.columns[j]
                corr_value = corr_matrix.iloc[i, j]
                high_corr_pairs.append((col1, col2, corr_value))
                print(f"• {col1} & {col2}: {corr_value:.3f}")

    # Identify columns to drop
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    
    # Remove features if any to drop
    if to_drop:
        print(f"\nDropping {len(to_drop)} features due to multicollinearity:")
        print(", ".join(to_drop))
        
        X_train = X_train.drop(to_drop, axis=1)
        X_val = X_val.drop(to_drop, axis=1)
        X_test = X_test.drop(to_drop, axis=1)
    else:
        print("\nNo features meet correlation threshold for removal")
    
    # Return in original format
    if is_numpy:
        return X_train.values, X_val.values, X_test.values
    return X_train, X_val, X_test

# To use the function with your datasets
X_train_final, X_val_final, X_test_final = remove_multicollinearity(
    X_train_processed, X_val_processed, X_test_processed, threshold=0.85
)

# print(X_train_final.shape, X_val_final.shape,X_test_final.shape)
Feature correlations above threshold:

No features meet correlation threshold for removal

5. Modeling¶

Let's implement various machine learning models and evaluate their performance.

The choice of metrics depends on whether your classification problem is binary, multi-class, or imbalanced. Here are the most common evaluation metrics to compare your models:

General Metrics for Classification¶
Metric Description Best for
Accuracy Percentage of correct predictions. Balanced datasets with equal class distribution.
Precision Measures how many predicted positives are actually positive (TP / (TP + FP)). When false positives are costly (e.g., fraud detection).
Recall (Sensitivity) Measures how many actual positives were correctly predicted (TP / (TP + FN)). When false negatives are costly (e.g., medical diagnosis).
F1-Score Harmonic mean of precision and recall. Balances both metrics. Imbalanced datasets.
ROC-AUC (Receiver Operating Characteristic - Area Under Curve) Measures how well a model distinguishes between classes. Binary classification, imbalanced datasets.
PR-AUC (Precision-Recall AUC) Measures precision vs. recall trade-off. Imbalanced datasets.
Log Loss (Cross-Entropy Loss) Measures the uncertainty of the model’s predictions. Probabilistic classification.
Computational Performance Metrics¶
Metric Description Best for
Training Time Measures how long the model takes to train. Large datasets.
Inference Time Measures how fast the model predicts new data. Real-time applications.
Model Size How much memory the model consumes. When deployment constraints exist.
How to Compare Models?¶
  1. Train all models on the same dataset.
  2. Use cross-validation** to reduce variance.
  3. Record performance metrics (F1-score, ROC-AUC, etc.).
  4. Compare training time and inference speed if necessary.
  5. Pick the best model based on the most relevant metric for your application.

5.1 Model Training and Evaluation¶

InĀ [72]:
# First, make sure you're using the updated variables consistently
print(f"X_train_final shape: {X_train_final.shape}")
print(f"y_train_out shape: {y_train_out.shape}")
print(f"X_val_final shape: {X_val_final.shape}")
print(f"y_val_out shape: {y_val_out.shape}")

# Function to train, evaluate and store model metrics
def train_and_evaluate_model(model, model_name, X_train, y_train, X_val, y_val):
    # Use stratified cross-validation for training metrics
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    # Start training timer
    start_time = time.time()
    
    # Train model
    model.fit(X_train, y_train)
    
    # Calculate training time
    training_time = time.time() - start_time
    
    # Get cross-validation scores
    if hasattr(model, "predict_proba"):
        cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1', n_jobs=-1)
    else:
        cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1', n_jobs=-1)
    
    # Prediction timing
    start_time = time.time()
    y_val_pred = model.predict(X_val)
    inference_time = time.time() - start_time
    
    # Get probabilities if available
    if hasattr(model, "predict_proba"):
        y_val_proba = model.predict_proba(X_val)[:, 1]
    else:
        y_val_proba = None

    # Calculate metrics
    accuracy = accuracy_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    recall = recall_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    
    # ROC-AUC only if probabilities are available
    roc_auc = roc_auc_score(y_val, y_val_proba) if y_val_proba is not None else None

    # Calculate PR-AUC and Log Loss if probabilities are available
    if y_val_proba is not None:
        precision_curve, recall_curve, _ = precision_recall_curve(y_val, y_val_proba)
        pr_auc = auc(recall_curve, precision_curve)
        log_loss_value = log_loss(y_val, y_val_proba)
    else:
        pr_auc = None
        log_loss_value = None

    # Calculate model size (with error handling)
    try:
        with open('temp_model.pkl', 'wb') as f:
            pickle.dump(model, f)
        model_size = os.path.getsize('temp_model.pkl') / (1024 * 1024)  # Convert bytes to MB
        try:
            os.remove('temp_model.pkl')
        except:
            pass  # Ignore errors in file deletion
    except:
        model_size = float('nan')  # If file operations fail

    # Store results
    results = {
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC-AUC': roc_auc,
        'PR-AUC': pr_auc,
        'Log Loss': log_loss_value,
        'CV F1 (mean)': cv_scores.mean(),
        'CV F1 (std)': cv_scores.std(),
        'Training Time (s)': training_time,
        'Inference Time (s)': inference_time,
        'Model Size (MB)': model_size
    }

    return results, model

# Dictionary of models for flexible selection
models_dict = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_jobs=-1),
    # 'SVM': SVC(random_state=42, probability=True),
    'KNN': KNeighborsClassifier(n_jobs=-1),
    'NaĆÆve Bayes': GaussianNB(),
    'XGBoost': xgb.XGBClassifier(random_state=42, n_jobs=-1, eval_metric='logloss'),
    'LDA': LinearDiscriminantAnalysis(),
    'MLP': MLPClassifier(random_state=42)
}

# Train and evaluate each model - EXPLICITLY use the updated y variables
results_list = []
trained_models = {}

for name, model in models_dict.items():
    print(f"Training {name}...")
    try:
        # CRITICAL: Use y_train_out and y_val_out instead of y_train and y_val
        result, trained_model = train_and_evaluate_model(
            model, name, X_train_final, y_train_out, X_val_final, y_val_out
        )
        results_list.append(result)
        trained_models[name] = trained_model
        print(f"Completed {name}")
    except Exception as e:
        print(f"Error training {name}: {str(e)}")
        # Continue with next model if one fails

# Check if we have any results before creating DataFrame
if results_list:
    # Create results dataframe
    results_df = pd.DataFrame(results_list)
    results_df.set_index('Model', inplace=True)

    # Display results with cross-validation scores
    print("Model Performance with Cross-Validation:")
    print(results_df)

    # Create a sorted version by F1 score for easy comparison
    sorted_results = results_df.sort_values('F1 Score', ascending=False)
    print("\nModels ranked by F1 Score:")
    print(sorted_results[['F1 Score', 'CV F1 (mean)', 'CV F1 (std)', 'Precision', 'Recall']])
else:
    print("No models were successfully trained. Check the error messages above.")
X_train_final shape: (155305, 40)
y_train_out shape: (155305,)
X_val_final shape: (38829, 40)
y_val_out shape: (38829,)
Training Logistic Regression...
Completed Logistic Regression
Training Decision Tree...
Completed Decision Tree
Training Random Forest...
Completed Random Forest
Training KNN...
Completed KNN
Training NaĆÆve Bayes...
Completed NaĆÆve Bayes
Training XGBoost...
Completed XGBoost
Training LDA...
Completed LDA
Training MLP...
Completed MLP
Model Performance with Cross-Validation:
                     Accuracy  Precision    Recall  F1 Score   ROC-AUC    PR-AUC   Log Loss  CV F1 (mean)  CV F1 (std)  Training Time (s)  Inference Time (s)  Model Size (MB)
Model                                                                                                                                                                         
Logistic Regression  0.932010   0.428246  0.498674  0.460784  0.913887  0.437560   0.165848      0.400421     0.011163           0.460003            0.004004         0.001332
Decision Tree        0.709856   0.099680  0.495579  0.165976  0.609345  0.312322  10.457848      0.373394     0.008861          15.315998            0.006999         0.863006
Random Forest        0.936156   0.362832  0.126879  0.188012  0.843693  0.249757   0.286078      0.416791     0.008346          18.021109            0.055038        75.958649
KNN                  0.938886   0.468268  0.362069  0.408377  0.830542  0.409469   0.647009      0.430705     0.007753           0.012998            5.441484        48.581213
NaĆÆve Bayes          0.894975   0.215361  0.303714  0.252018  0.726299  0.177344   0.951964      0.366624     0.001954           0.057999            0.024002         0.002119
XGBoost              0.909037   0.242289  0.263926  0.252645  0.842749  0.214116   0.215698      0.463628     0.006029           0.777999            0.021000         0.374887
LDA                  0.704757   0.148725  0.861185  0.253646  0.876101  0.368238   1.107817      0.476389     0.004756           0.349511            0.005999         0.002656
MLP                  0.942234   0.604396  0.024315  0.046749  0.466487  0.102347   1.561014      0.490386     0.013336          53.697011            0.022002         0.138672

Models ranked by F1 Score:
                     F1 Score  CV F1 (mean)  CV F1 (std)  Precision    Recall
Model                                                                        
Logistic Regression  0.460784      0.400421     0.011163   0.428246  0.498674
KNN                  0.408377      0.430705     0.007753   0.468268  0.362069
LDA                  0.253646      0.476389     0.004756   0.148725  0.861185
XGBoost              0.252645      0.463628     0.006029   0.242289  0.263926
NaĆÆve Bayes          0.252018      0.366624     0.001954   0.215361  0.303714
Random Forest        0.188012      0.416791     0.008346   0.362832  0.126879
Decision Tree        0.165976      0.373394     0.008861   0.099680  0.495579
MLP                  0.046749      0.490386     0.013336   0.604396  0.024315

5.2 Hyperparameter Tuning for Top Models¶

Let's tune the hyperparameters of our top performing models to improve their performance.

InĀ [73]:
# 5.3 Hyperparameter Tuning for Top Models

# Select top 3 models based on F1 score
top_models = results_df.sort_values('F1 Score', ascending=False).head(3).index.tolist()
print(f"Top 3 models for hyperparameter tuning: {top_models}")

# Hyperparameter grids for each model
param_grids = {
   'Logistic Regression': {
       'C': [0.01, 0.1, 1, 10, 100],
       'penalty': ['l2'],
       'solver': ['liblinear', 'saga']
   },
   'Random Forest': {
       'n_estimators': [100, 200, 300],
       'max_depth': [None, 10, 20, 30],
       'min_samples_split': [2, 5, 10],
       'min_samples_leaf': [1, 2, 4]
   },
   'XGBoost': {
       'n_estimators': [100, 200, 300],
       'max_depth': [3, 5, 7],
       'learning_rate': [0.01, 0.1, 0.2],
       'subsample': [0.8, 0.9, 1.0]
   },
   'SVM': {
       'C': [0.1, 1, 10],
       'gamma': ['scale', 'auto', 0.1, 0.01],
       'kernel': ['rbf', 'linear']
   },
   'KNN': {
       'n_neighbors': [3, 5, 7, 9],
       'weights': ['uniform', 'distance'],
       'p': [1, 2]
   },
   'Decision Tree': {
       'max_depth': [None, 10, 20, 30],
       'min_samples_split': [2, 5, 10],
       'min_samples_leaf': [1, 2, 4],
       'criterion': ['gini', 'entropy']
   },
   'MLP': {
       'hidden_layer_sizes': [(50,), (100,), (50, 50)],
       'activation': ['relu', 'tanh'],
       'alpha': [0.0001, 0.001, 0.01]
   },
   'LDA': {
       'solver': ['svd', 'lsqr', 'eigen'],
       'shrinkage': [None, 'auto', 0.1, 0.5, 0.9]
   },
   'NaĆÆve Bayes': {
       'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]
   }
}

# Perform hyperparameter tuning for top models
tuned_models = {}
tuned_results = []

# Initialize the StratifiedKFold for consistent cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for model_name in top_models:
   print(f"Tuning {model_name}...")
   
   # Get base model and parameter grid
   base_model = models_dict[model_name]
   param_grid = param_grids[model_name]
   
   # Create grid search with stratified cross-validation
   grid_search = GridSearchCV(
       estimator=base_model,
       param_grid=param_grid,
       scoring='f1',
       cv=cv,  # Use stratified k-fold
       verbose=1,
       n_jobs=-1
   )
   
   # Fit grid search with the correctly processed data
   try:
       grid_search.fit(X_train_final, y_train_out)
       
       # Get best model
       best_model = grid_search.best_estimator_
       tuned_models[model_name] = best_model
       
       # Evaluate tuned model on validation set
       result, _ = train_and_evaluate_model(
           best_model, f"{model_name} (Tuned)", 
           X_train_final, y_train_out, X_val_final, y_val_out
       )
       
       tuned_results.append(result)
       
       print(f"Best parameters for {model_name}: {grid_search.best_params_}")
       print(f"Best cross-validation F1 score: {grid_search.best_score_:.4f}")
       print(f"Completed tuning {model_name}")
   
   except Exception as e:
       print(f"Error tuning {model_name}: {str(e)}")
       continue

# Check if any models were successfully tuned
if tuned_results:
    # Create tuned results dataframe
    tuned_results_df = pd.DataFrame(tuned_results)
    tuned_results_df.set_index('Model', inplace=True)
    
    print("\nTuned Model Performance:")
    print(tuned_results_df)
    
    # Compare with original models
    comparison_models = []
    for model_name in top_models:
        if model_name in results_df.index and f"{model_name} (Tuned)" in tuned_results_df.index:
            original_f1 = results_df.loc[model_name, 'F1 Score']
            tuned_f1 = tuned_results_df.loc[f"{model_name} (Tuned)", 'F1 Score']
            improvement = ((tuned_f1 - original_f1) / original_f1) * 100
            comparison_models.append({
                'Model': model_name,
                'Original F1': original_f1,
                'Tuned F1': tuned_f1,
                'Improvement (%)': improvement
            })
    
    if comparison_models:
        comparison_df = pd.DataFrame(comparison_models)
        print("\nPerformance Improvement After Tuning:")
        print(comparison_df)
else:
    print("No models were successfully tuned. Check the error messages above.")
Top 3 models for hyperparameter tuning: ['Logistic Regression', 'KNN', 'LDA']
Tuning Logistic Regression...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters for Logistic Regression: {'C': 100, 'penalty': 'l2', 'solver': 'saga'}
Best cross-validation F1 score: 0.4015
Completed tuning Logistic Regression
Tuning KNN...
Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best parameters for KNN: {'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
Best cross-validation F1 score: 0.4357
Completed tuning KNN
Tuning LDA...
Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best parameters for LDA: {'shrinkage': None, 'solver': 'svd'}
Best cross-validation F1 score: 0.4764
Completed tuning LDA

Tuned Model Performance:
                             Accuracy  Precision    Recall  F1 Score   ROC-AUC    PR-AUC  Log Loss  CV F1 (mean)  CV F1 (std)  Training Time (s)  Inference Time (s)  Model Size (MB)
Model                                                                                                                                                                                
Logistic Regression (Tuned)  0.924592   0.395085  0.554377  0.461369  0.912713  0.433808  0.179540      0.401456     0.011360           5.922000            0.003002         0.001325
KNN (Tuned)                  0.946406   0.623129  0.202476  0.305639  0.821283  0.401428  0.610199      0.435672     0.009825           0.012002           27.036997        48.581213
LDA (Tuned)                  0.704757   0.148725  0.861185  0.253646  0.876101  0.368238  1.107817      0.476389     0.004756           0.293998            0.003008         0.002656

Performance Improvement After Tuning:
                 Model  Original F1  Tuned F1  Improvement (%)
0  Logistic Regression     0.460784  0.461369         0.126814
1                  KNN     0.408377  0.305639       -25.157636
2                  LDA     0.253646  0.253646         0.000000

5.3 Build Ensemble Model¶

InĀ [74]:
# Create ensemble model using the top 3 tuned models
ensemble_models = []
for name in top_models:
    if name in tuned_models:
        ensemble_models.append((name, tuned_models[name]))

# Check if we have models to ensemble
if len(ensemble_models) >= 2:
    print(f"Building ensemble with {len(ensemble_models)} models: {[name for name, _ in ensemble_models]}")
    
    # Create and train voting classifier
    ensemble = VotingClassifier(
        estimators=ensemble_models,
        voting='soft'  # Use predicted probabilities
    )
    
    # Train and evaluate ensemble using correct datasets
    try:
        ensemble_result, ensemble_model = train_and_evaluate_model(
            ensemble, "Ensemble", 
            X_train_final, y_train_out, X_val_final, y_val_out
        )
        
        # Add ensemble result to tuned results
        ensemble_df = pd.DataFrame([ensemble_result]).set_index('Model')
        final_results = pd.concat([tuned_results_df, ensemble_df])
        
        print("\nEnsemble Model Performance:")
        print(ensemble_df)
        
        print("\nAll Models Performance (including Ensemble):")
        print(final_results.sort_values('F1 Score', ascending=False))
    
    except Exception as e:
        print(f"Error building ensemble: {str(e)}")
else:
    print(f"Not enough tuned models to build an ensemble. Need at least 2, but only have {len(ensemble_models)}.")
    # If we have tuned models, just display those
    if tuned_results:
        final_results = tuned_results_df
        print("\nTuned Models Performance:")
        print(final_results.sort_values('F1 Score', ascending=False))
Building ensemble with 3 models: ['Logistic Regression', 'KNN', 'LDA']

Ensemble Model Performance:
          Accuracy  Precision    Recall  F1 Score   ROC-AUC   PR-AUC  Log Loss  CV F1 (mean)  CV F1 (std)  Training Time (s)  Inference Time (s)  Model Size (MB)
Model                                                                                                                                                            
Ensemble  0.908573   0.348684  0.656057  0.455354  0.902643  0.46455  0.227501      0.468879     0.005031           6.343002           26.583818         97.16736

All Models Performance (including Ensemble):
                             Accuracy  Precision    Recall  F1 Score   ROC-AUC    PR-AUC  Log Loss  CV F1 (mean)  CV F1 (std)  Training Time (s)  Inference Time (s)  Model Size (MB)
Model                                                                                                                                                                                
Logistic Regression (Tuned)  0.924592   0.395085  0.554377  0.461369  0.912713  0.433808  0.179540      0.401456     0.011360           5.922000            0.003002         0.001325
Ensemble                     0.908573   0.348684  0.656057  0.455354  0.902643  0.464550  0.227501      0.468879     0.005031           6.343002           26.583818        97.167360
KNN (Tuned)                  0.946406   0.623129  0.202476  0.305639  0.821283  0.401428  0.610199      0.435672     0.009825           0.012002           27.036997        48.581213
LDA (Tuned)                  0.704757   0.148725  0.861185  0.253646  0.876101  0.368238  1.107817      0.476389     0.004756           0.293998            0.003008         0.002656

6. Final Model Evaluation¶

Let's evaluate our final models on the test set to get unbiased performance estimates.

InĀ [75]:
# Function to evaluate model on test set
def evaluate_on_test(model, model_name, X_test, y_test):
   # Make predictions
   y_test_pred = model.predict(X_test)
   
   # Get probabilities if available
   if hasattr(model, "predict_proba"):
       y_test_proba = model.predict_proba(X_test)[:, 1]
   else:
       y_test_proba = None
   
   # Calculate metrics
   accuracy = accuracy_score(y_test, y_test_pred)
   precision = precision_score(y_test, y_test_pred)
   recall = recall_score(y_test, y_test_pred)
   f1 = f1_score(y_test, y_test_pred)
   
   # ROC-AUC only if probabilities are available
   roc_auc = roc_auc_score(y_test, y_test_proba) if y_test_proba is not None else None
   
   # Calculate PR-AUC and Log Loss if probabilities are available
   if y_test_proba is not None:
       precision_curve, recall_curve, _ = precision_recall_curve(y_test, y_test_proba)
       pr_auc = auc(recall_curve, precision_curve)
       log_loss_value = log_loss(y_test, y_test_proba)
   else:
       pr_auc = None
       log_loss_value = None
   
   # Store results
   results = {
       'Model': model_name,
       'Accuracy': accuracy,
       'Precision': precision,
       'Recall': recall,
       'F1 Score': f1,
       'ROC-AUC': roc_auc,
       'PR-AUC': pr_auc,
       'Log Loss': log_loss_value
   }
   
   return results, y_test_pred, y_test_proba

# Evaluate top models and ensemble on test set
test_results = []
test_predictions = {}
test_probabilities = {}

# Check which models we have available to evaluate
models_to_evaluate = {}

# Add tuned models
for model_name in tuned_models:
    models_to_evaluate[f"{model_name} (Tuned)"] = tuned_models[model_name]

# Add ensemble if available
if 'ensemble_model' in locals():
    models_to_evaluate["Ensemble"] = ensemble_model

print(f"Evaluating {len(models_to_evaluate)} models on the test set...")

# Evaluate each model
for model_name, model in models_to_evaluate.items():
    try:
        print(f"Evaluating {model_name}...")
        result, y_pred, y_proba = evaluate_on_test(
            model, model_name, X_test_final, y_test_out
        )
        test_results.append(result)
        test_predictions[model_name] = y_pred
        test_probabilities[model_name] = y_proba
        print(f"Completed evaluation of {model_name}")
    except Exception as e:
        print(f"Error evaluating {model_name}: {str(e)}")

# Create test results dataframe
if test_results:
    test_results_df = pd.DataFrame(test_results)
    test_results_df.set_index('Model', inplace=True)
    
    print("\nTest Set Performance:")
    print(test_results_df)
    
    # Sort by F1 score for easy comparison
    sorted_test_results = test_results_df.sort_values('F1 Score', ascending=False)
    print("\nModels ranked by Test F1 Score:")
    print(sorted_test_results[['F1 Score', 'Precision', 'Recall', 'Accuracy']])
    
    # Identify best model
    best_model_name = sorted_test_results.index[0]
    print(f"\nBest performing model on test set: {best_model_name}")
    for metric in ['F1 Score', 'Precision', 'Recall', 'Accuracy', 'ROC-AUC']:
        if metric in sorted_test_results.columns:
            print(f"{metric}: {sorted_test_results.loc[best_model_name, metric]:.4f}")
else:
    print("No models were successfully evaluated on the test set.")
Evaluating 4 models on the test set...
Evaluating Logistic Regression (Tuned)...
Completed evaluation of Logistic Regression (Tuned)
Evaluating KNN (Tuned)...
Completed evaluation of KNN (Tuned)
Evaluating LDA (Tuned)...
Completed evaluation of LDA (Tuned)
Evaluating Ensemble...
Completed evaluation of Ensemble

Test Set Performance:
                             Accuracy  Precision    Recall  F1 Score   ROC-AUC    PR-AUC  Log Loss
Model                                                                                             
Logistic Regression (Tuned)  0.943665   0.553529  0.288607  0.379398  0.908615  0.434577  0.154759
KNN (Tuned)                  0.943959   0.592197  0.195105  0.293510  0.808598  0.393574  0.675675
LDA (Tuned)                  0.729239   0.159965  0.832189  0.268347  0.872784  0.378184  0.947044
Ensemble                     0.935543   0.461732  0.484416  0.472802  0.895087  0.459025  0.204953

Models ranked by Test F1 Score:
                             F1 Score  Precision    Recall  Accuracy
Model                                                               
Ensemble                     0.472802   0.461732  0.484416  0.935543
Logistic Regression (Tuned)  0.379398   0.553529  0.288607  0.943665
KNN (Tuned)                  0.293510   0.592197  0.195105  0.943959
LDA (Tuned)                  0.268347   0.159965  0.832189  0.729239

Best performing model on test set: Ensemble
F1 Score: 0.4728
Precision: 0.4617
Recall: 0.4844
Accuracy: 0.9355
ROC-AUC: 0.8951

Let's visualize the performance of our top models on the test set.

InĀ [76]:
# Visualize test results - ROC curves
if test_results and len(test_probabilities) > 0:
    plt.figure(figsize=(10, 8))
    
    for model_name, y_pred_proba in test_probabilities.items():
        if y_pred_proba is not None:
            fpr, tpr, _ = roc_curve(y_test_out, y_pred_proba)
            roc_auc = auc(fpr, tpr)
            
            plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.3f})')
    
    plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Chance')
    plt.xlabel('False Positive Rate', fontsize=12)
    plt.ylabel('True Positive Rate', fontsize=12)
    plt.title('ROC Curves for Models on Test Set', fontsize=15)
    plt.legend(loc='lower right')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Precision-Recall curves
    plt.figure(figsize=(10, 8))
    
    for model_name, y_pred_proba in test_probabilities.items():
        if y_pred_proba is not None:
            precision, recall, _ = precision_recall_curve(y_test_out, y_pred_proba)
            pr_auc = auc(recall, precision)
            
            plt.plot(recall, precision, label=f'{model_name} (AUC = {pr_auc:.3f})')
    
    # Add baseline
    no_skill = sum(y_test_out) / len(y_test_out)
    plt.axhline(y=no_skill, linestyle='--', color='gray', label='Baseline')
    
    plt.xlabel('Recall', fontsize=12)
    plt.ylabel('Precision', fontsize=12)
    plt.title('Precision-Recall Curves for Models on Test Set', fontsize=15)
    plt.legend(loc='best')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image

7. Feature Importance Analysis¶

Let's analyze which features are most important for our best model.

InĀ [77]:
# Check if we have test_results_df
if 'test_results_df' not in locals() or test_results_df.empty:
    print("No test results available for feature importance analysis.")
else:
    # Determine best model based on test F1 score
    best_model_name = test_results_df['F1 Score'].idxmax()
    print(f"Best model based on test F1 score: {best_model_name}")
    
    # Get the best model
    best_model = None
    
    # Handle model name cases properly
    if best_model_name in models_to_evaluate:
        best_model = models_to_evaluate[best_model_name]
        print(f"Successfully retrieved model: {best_model_name}")
    else:
        print(f"Could not find model: {best_model_name}")
    
    # Only proceed if we have a model
    if best_model is not None:
        # Check if model is tree-based or has feature_importances_
        if hasattr(best_model, 'feature_importances_'):
            # Direct access for tree-based models
            importances = best_model.feature_importances_
            feature_names = X_train_final.columns
            
            # Create feature importance dataframe
            feature_importance_df = pd.DataFrame({
                'Feature': feature_names,
                'Importance': importances
            })
            
            # Sort by importance
            feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
            
            # Plot top 20 most important features
            plt.figure(figsize=(12, 10))
            top_features = feature_importance_df.head(20)
            
            ax = sns.barplot(x='Importance', y='Feature', data=top_features)
            plt.title(f'Top 20 Feature Importances for {best_model_name}', fontsize=15)
            plt.tight_layout()
            plt.show()
            
            print("Top 10 most important features:")
            print(feature_importance_df.head(10))
            
        # Handle VotingClassifier ensemble models
        elif hasattr(best_model, 'estimators'):  # Note: estimators not estimators_
            print("Best model is a VotingClassifier ensemble. Analyzing feature importance of its components.")
            
            # Access estimators differently in VotingClassifier
            for i, estimator in enumerate(best_model.estimators):
                # For VotingClassifier, estimators is a list of models, not (name, model) tuples
                if hasattr(estimator, 'feature_importances_'):
                    # Get the name based on the model type
                    name = type(estimator).__name__
                    
                    importances = estimator.feature_importances_
                    feature_names = X_train_final.columns
                    
                    # Create feature importance dataframe
                    feature_importance_df = pd.DataFrame({
                        'Feature': feature_names,
                        'Importance': importances
                    })
                    
                    # Sort by importance
                    feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
                    
                    # Plot top 20 most important features
                    plt.figure(figsize=(12, 10))
                    top_features = feature_importance_df.head(20)
                    
                    ax = sns.barplot(x='Importance', y='Feature', data=top_features)
                    plt.title(f'Top 20 Feature Importances for Ensemble Component {i+1}: {name}', fontsize=15)
                    plt.tight_layout()
                    plt.show()
                    
                    print(f"Top 10 most important features for ensemble component {i+1} ({name}):")
                    print(feature_importance_df.head(10))
                    break  # Just use the first tree-based model
                
        # For models without direct feature_importances_ 
        else:
            print(f"Model {best_model_name} does not directly expose feature importances.")
            
            try:
                # Try to get coefficients for linear models
                if hasattr(best_model, 'coef_'):
                    coef = best_model.coef_[0] if best_model.coef_.ndim > 1 else best_model.coef_
                    feature_names = X_train_final.columns
                    
                    # Create feature importance dataframe
                    feature_importance_df = pd.DataFrame({
                        'Feature': feature_names,
                        'Coefficient': coef
                    })
                    
                    # Sort by absolute coefficient value
                    feature_importance_df['Abs_Coefficient'] = feature_importance_df['Coefficient'].abs()
                    feature_importance_df = feature_importance_df.sort_values('Abs_Coefficient', ascending=False)
                    
                    # Plot top 20 most important features
                    plt.figure(figsize=(12, 10))
                    top_features = feature_importance_df.head(20)
                    
                    ax = sns.barplot(x='Coefficient', y='Feature', data=top_features)
                    plt.title(f'Top 20 Feature Coefficients for {best_model_name}', fontsize=15)
                    plt.tight_layout()
                    plt.show()
                    
                    print("Top 10 most important features by coefficient magnitude:")
                    print(feature_importance_df[['Feature', 'Coefficient']].head(10))
                else:
                    print("Consider using permutation importance or SHAP values for this model type.")
            except Exception as e:
                print(f"Error extracting feature importance: {str(e)}")
                print("Consider using permutation importance or SHAP values for this model type.")
Best model based on test F1 score: Ensemble
Successfully retrieved model: Ensemble
Best model is a VotingClassifier ensemble. Analyzing feature importance of its components.
InĀ [83]:
# Extract feature importance from the Logistic Regression tuned model
if 'Logistic Regression (Tuned)' in models_to_evaluate:
   # Get the model
   lr_model = models_to_evaluate['Logistic Regression (Tuned)']
   
   # Extract coefficients
   if hasattr(lr_model, 'coef_'):
       coef = lr_model.coef_[0] if lr_model.coef_.ndim > 1 else lr_model.coef_
       feature_names = X_train_final.columns
       
       # Create feature importance dataframe
       lr_importance_df = pd.DataFrame({
           'Feature': feature_names,
           'Coefficient': coef
       })
       
       # Sort by absolute coefficient value
       lr_importance_df['Abs_Coefficient'] = lr_importance_df['Coefficient'].abs()
       lr_importance_df = lr_importance_df.sort_values('Abs_Coefficient', ascending=False)
       
       # Plot top 20 most important features
       plt.figure(figsize=(12, 10))
       top_features = lr_importance_df.head(20)
       
       sns.barplot(x='Coefficient', y='Feature', data=top_features)
       plt.title('Top 20 Feature Coefficients for Logistic Regression (Tuned)', fontsize=15)
       plt.tight_layout()
       plt.show()
       
       print("Top 10 most important features by coefficient magnitude:")
       print(lr_importance_df[['Feature', 'Coefficient']].head(10))
   else:
       print("Logistic Regression model doesn't have coefficients attribute.")
else:
   print("Logistic Regression (Tuned) model not found in the evaluated models.")
No description has been provided for this image
Top 10 most important features by coefficient magnitude:
   Feature  Coefficient
0      PC1     1.591328
15    PC16     1.385211
14    PC15    -1.202561
19    PC20     1.013046
21    PC22    -0.975054
18    PC19     0.792207
35    PC36    -0.764262
34    PC35    -0.746540
12    PC13     0.613604
4      PC5    -0.569819